Text Extraction from PDFs

In this repo, I will provide a comprehensive guide on extracting text data from PDF files in Python. This approach will cover the text extraction for different components in PDFs such as:

Plain text
Tables
Images in the PDF

For the full guide you can read my article on Medium: https://bit.ly/3RtPuCw

To achieve that we will use the PDFMiner library to perform an initial analysis of the layout of the PDF and identify the proper tool needed for the specific component. Then based on the component found we will apply the appropriate function and Python Library.

The output of this process will be a Python dictionary containing information extracted for each page of the PDF file. Each key in this dictionary will present the page number of the document, and its corresponding value will be a list with the following 5 nested lists containing:

The text extracted per text block of the corpus
The format of the text in each text block in terms of font family and size
The text extracted from the images on the page
The text extracted from tables in a structured format
The complete text content of the page

You can see a flowchart of the process below:

To extract text from Plain Corpus:

We use the get_text() method of the LTTextContainer element provided be PDFMiner to extract the text presented in the container.
We iterate through the LTTextContainer object to access each LTTextLine object and then we access each individual character element as LTChar collecting the metadata for its fromat

To extract text from Images:

We use crop_image() function to find the coordinates of the image box detected from PDFMiner and then to crop and save it as a new PDF in our directory using the PyPDF2 library.
We employ the convert_from_file() function from the pdf2image library to convert all PDF files in the directory into a list of images, saving them in PNG format.
We use the Image package of the PIL module and implement the image_to_string() function of pytesseract to extract text from the images using the tesseract OCR engine.

To extract text from Tables:

We employ the extract_table() function, utilising the pdfplumber library,to extract the contents of the table into a list of lists
We table_converter() to join the contents of those lists in a table-like string.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Cover Image.png		Cover Image.png
Example PDF.pdf		Example PDF.pdf
PDF_Reader.ipynb		PDF_Reader.ipynb
README.md		README.md
Text Extraction Flowchart.png		Text Extraction Flowchart.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Extraction from PDFs

To extract text from Plain Corpus:

To extract text from Images:

To extract text from Tables:

About

Releases

Packages

Languages

mspeer/PDF_Text_Extraction

Folders and files

Latest commit

History

Repository files navigation

Text Extraction from PDFs

To extract text from Plain Corpus:

To extract text from Images:

To extract text from Tables:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages