LLM-Aided OCR Project

Introduction

The LLM-Aided OCR Project is an advanced system designed to significantly enhance the quality of Optical Character Recognition (OCR) output. By leveraging cutting-edge natural language processing techniques and large language models (LLMs), this project transforms raw OCR text into highly accurate, well-formatted, and readable documents.

Example Outputs

To see what the LLM-Aided OCR Project can do, check out these example outputs:

Original PDF
Raw OCR Output
LLM-Corrected Markdown Output

Features

PDF to image conversion
OCR using Tesseract
Advanced error correction using LLMs (local or API-based)
Smart text chunking for efficient processing
Markdown formatting option
Header and page number suppression (optional)
Quality assessment of the final output
Support for both local LLMs and cloud-based API providers (OpenAI, Anthropic)
Asynchronous processing for improved performance
Detailed logging for process tracking and debugging
GPU acceleration for local LLM inference

Detailed Technical Overview

PDF Processing and OCR

PDF to Image Conversion
- Function: convert_pdf_to_images()
- Uses pdf2image library to convert PDF pages into images
- Supports processing a subset of pages with max_pages and skip_first_n_pages parameters
OCR Processing
- Function: ocr_image()
- Utilizes pytesseract for text extraction
- Includes image preprocessing with preprocess_image() function:
  - Converts image to grayscale
  - Applies binary thresholding using Otsu's method
  - Performs dilation to enhance text clarity

Text Processing Pipeline

Chunk Creation
- The process_document() function splits the full text into manageable chunks
- Uses sentence boundaries for natural splits
- Implements an overlap between chunks to maintain context
Error Correction and Formatting
- Core function: process_chunk()
- Two-step process: a. OCR Correction:
  - Uses LLM to fix OCR-induced errors
  - Maintains original structure and content b. Markdown Formatting (optional):
  - Converts text to proper markdown format
  - Handles headings, lists, emphasis, and more
Duplicate Content Removal
- Implemented within the markdown formatting step
- Identifies and removes exact or near-exact repeated paragraphs
- Preserves unique content and ensures text flow
Header and Page Number Suppression (Optional)
- Can be configured to remove or distinctly format headers, footers, and page numbers

LLM Integration

Flexible LLM Support
- Supports both local LLMs and cloud-based API providers (OpenAI, Anthropic)
- Configurable through environment variables
Local LLM Handling
- Function: generate_completion_from_local_llm()
- Uses llama_cpp library for local LLM inference
- Supports custom grammars for structured output
API-based LLM Handling
- Functions: generate_completion_from_claude() and generate_completion_from_openai()
- Implements proper error handling and retry logic
- Manages token limits and adjusts request sizes dynamically
Asynchronous Processing
- Uses asyncio for concurrent processing of chunks when using API-based LLMs
- Maintains order of processed chunks for coherent final output

Token Management

Token Estimation
- Function: estimate_tokens()
- Uses model-specific tokenizers when available
- Falls back to approximate_tokens() for quick estimation
Dynamic Token Adjustment
- Adjusts max_tokens parameter based on prompt length and model limits
- Implements TOKEN_BUFFER and TOKEN_CUSHION for safe token management

Quality Assessment

Output Quality Evaluation
- Function: assess_output_quality()
- Compares original OCR text with processed output
- Uses LLM to provide a quality score and explanation

Logging and Error Handling

Comprehensive logging throughout the codebase
Detailed error messages and stack traces for debugging
Suppresses HTTP request logs to reduce noise

Configuration and Customization

The project uses a .env file for easy configuration. Key settings include:

LLM selection (local or API-based)
API provider selection
Model selection for different providers
Token limits and buffer sizes
Markdown formatting options

Output and File Handling

Raw OCR Output: Saved as {base_name}__raw_ocr_output.txt
LLM Corrected Output: Saved as {base_name}_llm_corrected.md or .txt

The script generates detailed logs of the entire process, including timing information and quality assessments.

Requirements

Python 3.12+
Tesseract OCR engine
PDF2Image library
PyTesseract
OpenAI API (optional)
Anthropic API (optional)
Local LLM support (optional, requires compatible GGUF model)

Installation

Install Pyenv and Python 3.12 (if needed):

# Install Pyenv and python 3.12 if needed and then use it to create venv:
if ! command -v pyenv &> /dev/null; then
    sudo apt-get update
    sudo apt-get install -y build-essential libssl-dev zlib1g-dev libbz2-dev \
    libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev libncursesw5-dev \
    xz-utils tk-dev libffi-dev liblzma-dev python3-openssl git

    git clone https://github.com/pyenv/pyenv.git ~/.pyenv
    echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.zshrc
    echo 'export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.zshrc
    echo 'eval "$(pyenv init --path)"' >> ~/.zshrc
    source ~/.zshrc
fi
cd ~/.pyenv && git pull && cd -
pyenv install 3.12

Set up the project:

# Use pyenv to create virtual environment:
git clone https://github.com/Dicklesworthstone/llm_aided_ocr    
cd llm_aided_ocr          
pyenv local 3.12
python -m venv venv
source venv/bin/activate
python -m pip install --upgrade pip
python -m pip install wheel
python -m pip install --upgrade setuptools wheel
pip install -r requirements.txt

Install Tesseract OCR engine (if not already installed):
- For Ubuntu: sudo apt-get install tesseract-ocr
- For macOS: brew install tesseract
- For Windows: Download and install from GitHub

Set up your environment variables in a .env file:

USE_LOCAL_LLM=False
API_PROVIDER=OPENAI
OPENAI_API_KEY=your_openai_api_key
ANTHROPIC_API_KEY=your_anthropic_api_key

Usage

Place your PDF file in the project directory.
Update the input_pdf_file_path variable in the main() function with your PDF filename.
Run the script:
```
python llm_aided_ocr.py
```
The script will generate several output files, including the final post-processed text.

How It Works

The LLM-Aided OCR project employs a multi-step process to transform raw OCR output into high-quality, readable text:

PDF Conversion: Converts input PDF into images using pdf2image.
OCR: Applies Tesseract OCR to extract text from images.
Text Chunking: Splits the raw OCR output into manageable chunks for processing.
Error Correction: Each chunk undergoes LLM-based processing to correct OCR errors and improve readability.
Markdown Formatting (Optional): Reformats the corrected text into clean, consistent Markdown.
Quality Assessment: An LLM-based evaluation compares the final output quality to the original OCR text.

Code Optimization

Concurrent Processing: When using API-based models, chunks are processed concurrently to improve speed.
Context Preservation: Each chunk includes a small overlap with the previous chunk to maintain context.
Adaptive Token Management: The system dynamically adjusts the number of tokens used for LLM requests based on input size and model constraints.

Configuration

The project uses a .env file for configuration. Key settings include:

USE_LOCAL_LLM: Set to True to use a local LLM, False for API-based LLMs.
API_PROVIDER: Choose between "OPENAI" or "CLAUDE".
OPENAI_API_KEY, ANTHROPIC_API_KEY: API keys for respective services.
CLAUDE_MODEL_STRING, OPENAI_COMPLETION_MODEL: Specify the model to use for each provider.
LOCAL_LLM_CONTEXT_SIZE_IN_TOKENS: Set the context size for local LLMs.

Output Files

The script generates several output files:

{base_name}__raw_ocr_output.txt: Raw OCR output from Tesseract.
{base_name}_llm_corrected.md: Final LLM-corrected and formatted text.

Limitations and Future Improvements

The system's performance is heavily dependent on the quality of the LLM used.
Processing very large documents can be time-consuming and may require significant computational resources.

Contributing

Contributions to this project are welcome! Please fork the repository and submit a pull request with your proposed changes.

License

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

LLM-Aided OCR Project

Introduction

Example Outputs

Features

Detailed Technical Overview

PDF Processing and OCR

Text Processing Pipeline

LLM Integration

Token Management

Quality Assessment

Logging and Error Handling

Configuration and Customization

Output and File Handling

Requirements

Installation

Usage

How It Works

Code Optimization

Configuration

Output Files

Limitations and Future Improvements

Contributing

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

LLM-Aided OCR Project

Introduction

Example Outputs

Features

Detailed Technical Overview

PDF Processing and OCR

Text Processing Pipeline

LLM Integration

Token Management

Quality Assessment

Logging and Error Handling

Configuration and Customization

Output and File Handling

Requirements

Installation

Usage

How It Works

Code Optimization

Configuration

Output Files

Limitations and Future Improvements

Contributing

License