Skip to content
This repository has been archived by the owner on Nov 7, 2018. It is now read-only.
/ doc_processing_toolkit Public archive

Python library to extract text from PDF, and default to OCR when text extraction fails.

License

Notifications You must be signed in to change notification settings

18F/doc_processing_toolkit

Repository files navigation

Document Processing Toolkit

Coverage Status

About

Python library to extract text from any file type compatiable with TIKA. It defaults to OCR when text extraction of a PDF file fails.

Dependencies
Installation
  1. Download tika-server-1.7.jar from Apache Tika
  2. Mac: brew install ghostscripts Ubuntu: sudo apt-get install ghostscript
  3. Mac: brew install tesseract Ubuntu: sudo apt-get install tesseract-ocr
  4. Mac: brew tap homebrew/x11 and brew install xpdf Ubuntu: sudo apt-get install poppler-utils
  5. Install Python dependencies with pip install -r requirements.txt
Usage

These script assume that an instance of Tika server is running. Starting Tika Servers java -jar tika-server-1.7.jar --port 9998

In Python script

from textextraction.extractors import text_extractor
text_extractor(doc_path=doc_path, force_convert=False)
Tests

In order to run tests:

  1. All requirements must be installed
  2. Both Tika servers need to be running

Tests are run with nose Installation pip install -r test-requirements.txt Running tests nosetests

OCR methodology

Documents are converted to gray PNGs with a DPI of 300 using Ghostscript and then OCRed with Tesseract. Settings for OCR adapted from OPTIMAL IMAGE CONVERSION SETTINGS FOR TESSERACT OCR and The Free Law Project's Courtlistener.

About

Python library to extract text from PDF, and default to OCR when text extraction fails.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages