Extract Tables and Data from old PDFs

Project Overview

Extracting tables from old image pdf files could be a very messy process, if requires many times of adjustment to accommodate random changes in the formats of pdf file.

This repository contains scripts for downloading and processing PDF files from the Pakistan census as an example. The process involves extracting data from downloaded PDFs and converting this information into structured formats suitable for analysis.

The main components include:

download_files.py - This script downloads PDF files containing census data and converts them to JPEG images for easier processing.
extract_text.py - Uses Optical Character Recognition (OCR) to extract text from the converted JPEG images and saves the data into an Excel file.

Prerequisites

Before running these scripts, ensure you have the following installed:

Python 3.8 or higher
Libraries: requests, pdf2image, Pillow, opencv-python, pytesseract, and pandas.
Tesseract-OCR: This project uses Pytesseract, which is a wrapper for Google’s Tesseract-OCR Engine. It must be installed separately from the Python packages.

Installation

Clone the repository:

git clone https://github.com/your-username/your-repository.git

Navigate to the cloned directory:
```
cd your-repository
```
Install the required Python libraries:
```
pip install -r requirements.txt
```

Usage

Running the download script:

python download_files.py

This script will download all PDFs specified within the script and convert each to JPEG format, storing them in the designated directory.

Running the extraction script:

python extract_text.py

After converting the PDFs to images, this script will perform OCR on the images to extract text and save it in an Excel file named Pakistan_religion.xlsx.

Configuration

You may need to adjust the paths and specific URLs in the scripts to match your directory structure or to point to different data sources.
OCR settings can be tuned in extract_text.py for better accuracy depending on the quality of the images.

Contributing

Contributions to this project are welcome. Please fork the repository and submit a pull request with your changes.

License

This project is licensed under the MIT License - see the LICENSE.md file for details.

Contact

For support or queries, contact [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
download_files.py		download_files.py
extract_text.py		extract_text.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Extract Tables and Data from old PDFs

Project Overview

Prerequisites

Installation

Usage

Running the download script:

Running the extraction script:

Configuration

Contributing

License

Contact

About

Releases

Packages

Languages

wali-reheman/Extract-data-from-old-pdf

Folders and files

Latest commit

History

Repository files navigation

Extract Tables and Data from old PDFs

Project Overview

Prerequisites

Installation

Usage

Running the download script:

Running the extraction script:

Configuration

Contributing

License

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages