Skip to content

Latest commit

 

History

History

phrases

Extracting noun/verb phrases from documents

Requirements:

  • input is either a stdin pipe or a filename
  • if the input file is not plain text, convert it to such
  • the input file can be PDF, Unicode, or ASCII
  • output is a histogram of noun and verb phrases, complete with adjective and adverbial modifiers, contained in the input

This script uses TextBlob for the heavy lifting.

Installation

This tool depends upon Python2 and a few C and Python libraries. See the first step below.

Note that one must be careful about macOS installations, which no longer include Python 2. We recommend using Homebrew's pyenv (and, indeed, these instructions assume a working Homebrew installation).

  1. Install distribution-level dependencies
  • Ubuntu/Debian: $ sudo apt install build-essential libpoppler-cpp-dev libmagic-dev pkg-config python3-venv
  • macOS: $ brew install poppler libmagic
  1. brew install pyenv pyenv-virtualenv (v2.4.10 is latest as of this writing)
  2. pyenv install 3.12.5 (v3.12.5 is the latest release of Python3)
  3. pyenv install 2.7.18 (v2.7.18 is the final release of Python2)
  4. pyenv global system 3.12.5 2.7.18 (puts both versions into the global environment)
  5. Run eval "$(pyenv init -)" and consider adding it to your shell startup.
  6. Create a Python virtual environment
  • $ python3 -m venv env makes one named env
  • $ source env/bin/activate lets you work in that environment
  • $ deactivate gets you back to your normal environment
  1. Install Python package dependencies, making sure you use Python2's pip:
  • $ pip2 install -r requirements.txt
  1. Install Pattern locally
  • $ pip2 install Pattern==2.6
  1. Download necessary NLTK data
  • $ python2 -c 'import nltk; nltk.download("brown"); nltk.download("punkt")'
  • $ python2 -m textblob.download_corpora

Testing the Installation

The provided Makefile has two rules that run the extraction commands on this README. If those commands run with no output beyond printing the selftest commands, the installation is working.

Usage

Extracting Noun Phrases

  • $ ./nouns.py $PDF_OR_TEXT_DOCUMENT.txt > out.csv or
  • $ ./nouns.py $PDF_OR_TEXT_DOCUMENT.pdf > out.csv or
  • $ cat $TEXT_DOCUMENT | ./nouns.py - > out.csv

Extracting Verb Phrases

  • $ ./verbs.py $PDF_OR_TEXT_DOCUMENT.txt > out.csv or
  • $ ./verbs.py $PDF_OR_TEXT_DOCUMENT.pdf > out.csv or
  • $ cat $TEXT_DOCUMENT | ./verbs.py - > out.csv

Useful Links