This repository contains the official code of OHR-Bench, a benchmark designed to evaluate the cascading impact of OCR on RAG.
- PDF, gt structured data and Q&A datasets: [🤗 Hugging Face]
pdfs
,gt_and_qas
. It includes 4000+ unstructured PDF pages from various domains, including Textbook, Law, Finance, Newspaper, Manual and Academia and Q&A datasets sourced from multimodal document elements. Each PDF page is equipped with a human-verified ground truth structured data. - Perturbed data with OCR errors: [🤗 Hugging Face]
retrieval_base/formatting_noise_[mild/moderate/severe]
andretrieval_base/semantic_noise_[mild/moderate/severe]
. In order to conduct in-depth analysis of the OCR's impact on RAG, OHR-Bench identifies Semantic Noise and Formatting Noise and introduce them with mild, moderate and severe perturbation based on real-world OCR errors. - Evaluation framework: [[Github opendatalab/OHR-Bench](GitHub - opendatalab/OHR-Bench: OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval)]. We provide a RAG evaluation framework to assess the impact of OCR processed structured data and our perturbed data on RAG including retrieval, generation and overall performance.
OCR | Retrieval | Generation | Overall | ||||
Edit Distance ↓ | LCS@1 ↑ | LCS@5 ↑ | EM ↑ | F1 ↑ | EM@1 ↑ | F1@1 ↑ | |
Ground Truth | - | 63.53 | 86.22 | 33.54 | 50.19 | 26.42 | 39.77 |
Pipeline-based OCR | |||||||
MinerU | 0.2328 | 52.53 | 73.61 | 30.50 | 46.08 | 24.52 | 36.84 |
Marker | 0.2621 | 56.94 | 78.53 | 30.08 | 46.02 | 23.89 | 36.51 |
DeepDoc | 0.2839 | 48.37 | 68.94 | 28.93 | 44.12 | 22.72 | 34.55 |
End-to-end OCR | |||||||
GOT | 0.2884 | 45.80 | 67.06 | 26.36 | 40.62 | 21.51 | 32.69 |
Nougat | 0.3303 | 44.77 | 61.46 | 24.81 | 37.94 | 20.40 | 30.89 |
Vision-Language Model for OCR | |||||||
Qwen2-VL-72B | 0.2564 | 53.16 | 72.97 | 26.72 | 41.23 | 23.45 | 35.91 |
InternVL2-Llama3-76B | 0.4450 | 42.43 | 57.51 | 20.74 | 32.89 | 20.58 | 31.23 |
We evaluate the suitability of current OCR solutions for real-world RAG applications by conducting comprehensive experiments with our OHR-Bench. We derive conclusions as follows:
- Pipeline-based OCR demonstrates the best performance. Employing Marker achieves the best retrieval performance across all OCR solutions, while MinerU dominates the generation and overall evaluation.
- All OCR solutions suffer performance degradation. Even the best solutions show a decrease of 1.9 in EM@1 and 2.93 F1@1 in the overall evaluation, with greater losses in the retrieval and generation stages.
pip install -r requirements.txt
To evaluate your RAG system on our benchmark, follow these steps:
- Download Perturbed Data: Get the data with formatting and semantic noise from the zip file in Hugging Face and unzip it.
- Organize the Data: Place the folders
retrieval_base/formatting_noise_[mild/moderate/severe]
andretrieval_base/semantic_noise_[mild/moderate/severe]
in thedata/retrieval_base
directory of this project. - Run Evaluation: Follow the instructions in Run Evaluation.
To evaluate your OCR results using this benchmark:
- Organize the Data: Do OCR with your OCR models (PDFs available on Hugging Face) and place the OCR processed structured data in the
data/retrieval_base
directory. Use the ground truth (data/retrieval_base/gt
) data as an example. The sub-folder names indicate the domain of the parsed results, and each JSON file, named as the same of corresponding PDF files, should contain the corresponding parsed results. - Run Evaluation: Follow the instructions in Run Evaluation.
Directory Structure
retrieval_base/gt/ # We provide gt and MinerU processed structured data as illustration here
├── finance # Domain
│ ├── 3M_2023Q2_10Q.json # Parsed results
│ ├── ...
├── textbook
...
OCR Processed Data
[
{
"page_idx": 0, // Page index
"text": "...", // OCR processed structured data
},
...
]
The qa data is placed in data/qas.json
. Each JSON file should be structured as follows:
Q&A JSON
[
{
"doc_name": "finance/JPMORGAN_2021Q1_10Q", // Document source
"ID": "00073cc2-c801-467c-9039-fca63c78c6a9", // Unique ID
"questions": "What was the total amount of nonaccrual loans retained as of March 31, 2021?",
"answers": "842",
"doc_type": "finance", // Q&A domain.
"answer_form": "Numeric", // Answer format.
"evidence_source": "table", // Evidence source.
"evidence_context": "Nonaccrual loans retained $^{(\\mathrm{a})}$ & \\$ & 842 & \\$ & 689 & $22 \\%$", // Evidence.
"evidence_page_no": 24
},
...
]
In src/configs
, configure your local LLM path or GPT API.
GPT_api_key = 'You KEY Here' # openai.api_key
...
Qwen2_7B_local_path = 'Qwen/Qwen2-7B-Instruct' # download from Hugging Face or your local path
To evaluate your OCR results, follow the instructions in the Dataset Preparation section to organize your OCR data.
# The first argument specifies which OCR results to use for evaluation.
# The second argument specifies the retrievers or LLMs.
# Args: Document source, LLM
# Generation with gt
bash shell/generation.sh gt qwen2_7b
# Generation with mild semantic noise
bash shell/generation.sh semantic_noise_mild qwen2_7b
# Args: Document source, retriver
# Retrieval with gt
bash shell/retrieval.sh gt bge-m3
# Retrieval with mild semantic noise
bash shell/retrieval.sh semantic_noise_mild bge-m3
# Args: Document source, retriver, LLM
# End-to-end with gt
bash shell/end2end.sh gt bge-m3 qwen2_7b
# End-to-end with mild semantic noise
bash shell/end2end.sh semantic_noise_mild bge-m3 qwen2_7b
The evaluation framework is based on CRUD, thanks so much for this brilliant project.
@article{zhang2024ocr,
title={OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation},
author={Junyuan Zhang and Qintong Zhang and Bin Wang and Linke Ouyang and Zichen Wen and Ying Li and Ka-Ho Chow and Conghui He and Wentao Zhang},
journal={arXiv preprint arXiv:2412.02592},
year={2024}
}
The PDFs are collected from public online channels and community user contributions. Content that is not allowed for distribution has been removed. The dataset is for research purposes only and not for commercial use. If there are any copyright concerns, please contact [email protected].