This workshop will equip newcowers with the cornerstones of a foundation for applying computational text analysis methods in their work. The focus is on high-level descriptions of what existing methods do and user-friendly implementations. It's drawn from the first day of a four-day workshop on computational text analysis held at the D-Lab. The following days cover regular expressions, unsupervised methods, and supervised methods.
- Provide a general roadmap of computational text analysis (CTA)
- Build intuitions about using text as data
- Gain practice with preprocessing and more
- Understand at a high-level:
- how a few primary CTA methods work
- what kinds of questions they answer
- how to design and implement a CTA project
We will get our hands dirty implementing some of the methods. This will be in Python. If you would like to follow along with the implementation details, you will need some familiarity with Python. If you haven't programmed in Python or at all, you are of course welcome to participate and learn the big ideas behind the methods.
For simplicity, just click the "Launch Binder" button to create a virtual environment ready for this workshop.
If you want to run the code on your computer, you have two options. You could use Anaconda to make installation easy: download Anaconda . Or if you already have Python 3.x installed with the full list of libraries listed under requirements.txt
, you're welcome to clone this repository and follow along on your own machine. You can also install all the necessary packages like so:
pip3 install -r requirements.txt
It's OK Not To Know! That's our motto at D-Lab. D-Lab is open to researchers and professionals from all disciplines and levels of experience.
- CTAWG (Computational Text Analysis Working Group) website
- Lectures from Stanford's NLP class
- Workshops on NLTK and SpaCy at the D-Lab
- Computational Text Analysis 4-day workshop at the D-Lab
- Info 256 Spring 2019 - Applied NLP class by David Bamman
If you spot a problem with these materials, please make an issue describing the problem.
These materials have evolved over a number of years. They were first developed for the D-Lab by Laura Nelson & Teddy Roland, with contributions and revisions made by Ben Gebre-Medhin, Geoff Bacon, and most recently by Caroline Le Pennec-Caldichoury.