Python library to extract text from any file type compatiable with TIKA. It defaults to OCR when text extraction of a PDF file fails.
- Download tika-server-1.7.jar from Apache Tika
- Mac:
brew install ghostscripts
Ubuntu:sudo apt-get install ghostscript
- Mac:
brew install tesseract
Ubuntu:sudo apt-get install tesseract-ocr
- Mac:
brew tap homebrew/x11
andbrew install xpdf
Ubuntu:sudo apt-get install poppler-utils
- Install Python dependencies with
pip install -r requirements.txt
These script assume that an instance of Tika server is running.
Starting Tika Servers
java -jar tika-server-1.7.jar --port 9998
In Python script
from textextraction.extractors import text_extractor
text_extractor(doc_path=doc_path, force_convert=False)
In order to run tests:
- All requirements must be installed
- Both Tika servers need to be running
Tests are run with nose
Installation
pip install -r test-requirements.txt
Running tests
nosetests
Documents are converted to gray PNGs with a DPI of 300 using Ghostscript and then OCRed with Tesseract. Settings for OCR adapted from OPTIMAL IMAGE CONVERSION SETTINGS FOR TESSERACT OCR and The Free Law Project's Courtlistener.