Skip to content

JonathanOppenheimer/peru-elections-ocr

Repository files navigation

OCR Signature Detection Project

This project focuses on detecting and counting handwritten signatures from specific areas on standardized Peru ONPE election records. The PDFs contain multiple signature boxes on different pages, and the project applies various image processing techniques to accurately detect these signatures.

Table of Contents


Installation

To set up this project locally, follow these steps:

1. Clone the Repository

First, clone the repository to your local machine:

git clone https://github.com/your-repo-url.git
cd your-repo-url

2. Set Up a Virtual Environment (Optional but recommended)

Create a virtual environment to isolate the project’s dependencies:

python3 -m venv .venv
source .venv/bin/activate  # On Windows, use .venv\Scripts\activate

3. Install Dependencies

Install the required Python packages:

pip install -r requirements.txt

4. Install Tesseract OCR

This project requires Tesseract for optical character recognition (OCR). Install Tesseract as follows:

  • macOS:

    brew install tesseract
  • Ubuntu/Debian:

    sudo apt install tesseract-ocr
  • Windows: Download and install Tesseract from here.

5. Verify Installation

To verify that everything is installed correctly, try running the example script:

python ocr_signature_detection.py

Usage

This project processes two-page PDF documents, detects signatures in predefined signature boxes on each page, and saves the signature count to a CSV file. Here's how to run the program:

  1. Place your input PDFs in the ./data/ directory.

  2. Run the signature detection script with:

    python ocr_signature_detection.py
  3. The results, including the table number and the count of signatures from the different boxes (numobs1, numobs2, numobs3), will be saved to a CSV file in the ./data/ directory.

Directory Structure

The key directories used by this project are:

  • ./data/ - For input PDFs and output CSV files.
  • ./templates/ - For storing empty signature box templates.
  • ./debug/ - (Optional, used during debugging to save intermediate images).

Image Processing Techniques

This project leverages several image processing techniques to reliably detect signatures within predefined regions of interest (bounding boxes). Below is a brief explanation of the techniques used:

1. Grayscale Conversion

Images are converted from RGB to grayscale. This helps simplify the image and reduces the complexity for further processing. Grayscale conversion is useful for working with binary thresholding and other pixel intensity-based techniques.

2. Median Blurring

We apply a median blur to the image, which is particularly useful for reducing noise (like ink spots or specks of dust). This type of blur helps remove salt-and-pepper noise while preserving the edges of the content (like handwritten signatures).

3. Otsu’s Thresholding

Otsu's method is an automatic thresholding technique that determines the optimal threshold value to convert the grayscale image into a binary image (black and white). This is critical for differentiating between the background and the actual signatures.

4. Structural Similarity Index (SSIM)

For signature detection, we use the SSIM algorithm to compare each signature box with a predefined empty template. SSIM measures the similarity between two images and helps us identify whether a signature is present in a given box. If the similarity score falls below a certain threshold, we assume a signature is present.


File Structure

.
├── data/                        # Input PDFs and output CSV files
│   └── example.pdf              # Example PDF for testing
├── templates/                   # Empty signature box templates
│   └── empty_numobs1.png        # Example template for numobs1
├── ocr_signature_detection.py   # Main Python script for signature detection
├── README.md                    # Project README
├── requirements.txt             # Project dependencies
└── signature_counts.csv         # Output CSV file with signature counts

Contributing

Contributions are welcome! If you have suggestions or find issues, feel free to open a pull request or issue on the repository.


License

This project is licensed under the MIT License.


Additional Notes

  • Templates: To improve signature detection, it's important to have accurate templates of the empty signature boxes for numobs1, numobs2, and numobs3. These templates can be manually selected using the post-processed images generated by the system.
  • Performance Considerations: If you notice any false positives or issues with the SSIM similarity score, try adjusting the SSIM threshold in the template_match_signature_area function.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages