Requirements:
- input is either a stdin pipe or a filename
- if the input file is not plain text, convert it to such
- the input file can be PDF, Unicode, or ASCII
- output is a histogram of noun and verb phrases, complete with adjective and adverbial modifiers, contained in the input
This script uses TextBlob for the heavy lifting.
This tool depends upon Python2 and a few C and Python libraries. See the first step below.
Note that one must be careful about macOS installations, which no longer
include Python 2. We recommend using Homebrew's pyenv
(and, indeed, these
instructions assume a working Homebrew installation).
- Install distribution-level dependencies
- Ubuntu/Debian:
$ sudo apt install build-essential libpoppler-cpp-dev libmagic-dev pkg-config python3-venv
- macOS:
$ brew install poppler libmagic
brew install pyenv pyenv-virtualenv
(v2.4.10 is latest as of this writing)pyenv install 3.12.5
(v3.12.5 is the latest release of Python3)pyenv install 2.7.18
(v2.7.18 is the final release of Python2)pyenv global system 3.12.5 2.7.18
(puts both versions into the global environment)- Run
eval "$(pyenv init -)"
and consider adding it to your shell startup. - Create a Python virtual environment
$ python3 -m venv env
makes one namedenv
$ source env/bin/activate
lets you work in that environment$ deactivate
gets you back to your normal environment
- Install Python package dependencies, making sure you use Python2's pip:
$ pip2 install -r requirements.txt
- Install Pattern locally
$ pip2 install Pattern==2.6
- Download necessary NLTK data
$ python2 -c 'import nltk; nltk.download("brown"); nltk.download("punkt")'
$ python2 -m textblob.download_corpora
The provided Makefile
has two rules that run the extraction
commands on this README. If those commands run with no output beyond
printing the selftest commands, the installation is working.
$ ./nouns.py $PDF_OR_TEXT_DOCUMENT.txt > out.csv
or$ ./nouns.py $PDF_OR_TEXT_DOCUMENT.pdf > out.csv
or$ cat $TEXT_DOCUMENT | ./nouns.py - > out.csv
$ ./verbs.py $PDF_OR_TEXT_DOCUMENT.txt > out.csv
or$ ./verbs.py $PDF_OR_TEXT_DOCUMENT.pdf > out.csv
or$ cat $TEXT_DOCUMENT | ./verbs.py - > out.csv