Parse Opticial character recognition based pdfs

This project involves identifying tables in OCR pdfs and extracting them. The documents used in this project are from 1920-1930's and OCR converted data are not quite accurate same words are often recognized with different words. So a lot of approximations have been made and manual preparation of data have been carried out. Also, manual supervision is needed at many instances to very and set the thresholds of approximation. Extracting tables and merging them from inaccurate OCR data using Fuzzy String search algorithm to best approximate words.

Libraries required:

pdfquery_utils
PyPDF2
fuzzywuzzy
pandas
tabula
logging

How to run it?

Step 1:

Run

Step1_Identify_Tables.ipynb

to identify and verify the table locations present in the pdfs.

Step 2:

Once the verification is carried out, Run

Step2_Split_PDFS_with_tables.ipynb

to extract all the tables identified in step 1.

Step 3:

Manually prepare unstructured csv files by copying them from pdfs into the csv files. Then with the csv files run the following file

Step3_Create_table.ipynb

This creates table for each file in original structured pdf format.

Step 4:

Run

Step3_Create_table.ipynb

to merge all the tables together to create the final output.

Helper Files

Following Ipython notebook will help in getting the distinct words out of all the words,

Cleanup_votes.ipynb

Following Ipython notebook will allow to create individual csv files from xlsv worksheets.

SplitExcelWorksheets.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Parse Opticial character recognition based pdfs

Libraries required:

How to run it?

Step 1:

Step 2:

Step 3:

Step 4:

Helper Files

Files

README.md

Latest commit

History

README.md

File metadata and controls

Parse Opticial character recognition based pdfs

Libraries required:

How to run it?

Step 1:

Step 2:

Step 3:

Step 4:

Helper Files