Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparison Test Between Surya and Pytesseract OCR Capabilities #303

Open
Gautam-Rajeev opened this issue Mar 27, 2024 · 2 comments
Open
Assignees

Comments

@Gautam-Rajeev
Copy link
Collaborator

Gautam-Rajeev commented Mar 27, 2024

Goal:

Conduct a comprehensive comparison between the OCR (Optical Character Recognition) capabilities of Surya and Pytesseract. The aim is to determine which tool performs better under various conditions and to evaluate if either tool offers unique functionalities not covered by the other. A well-curated test set will be developed to facilitate this comparison.

Description

The objective is to systematically compare Surya and Pytesseract, two leading OCR tools, to understand their strengths and weaknesses in processing different types of text. The comparison should cover various aspects such as accuracy, speed, handling of different languages, and the ability to recognize text in complex backgrounds or with various fonts and sizes. The test set should include a diverse range of images that reflect real-world use cases where OCR might be applied.

Key comparison metrics include:

  • Text recognition accuracy
  • Processing speed
  • Robustness across different image qualities
  • Support for multiple languages - focus on English, Hindi, Oriya
  • Ability to recognize text in complex layouts - look at tables, footnotes, charts etc

Implementation Details

To effectively compare Surya and Pytesseract, the following steps will be taken:

  • Developing a Test Set: Collect and/or create a diverse set of images that include plain text, text over images, handwritten notes, and texts in various fonts and sizes. Ensure the test set covers multiple languages and text orientations.
  • Benchmarking Criteria: Define clear metrics for comparison, including accuracy (measured by character and word recognition rates), speed (time taken to process images of varying sizes), and error rates across different languages and fonts.
  • Comparative Analysis: Run both Surya and Pytesseract on the test set, documenting their performance based on the predefined criteria.
  • Functionality Check: List and compare the features and functionalities offered by both tools, noting any unique capabilities or limitations.
  • Documentation and Reporting: Compile the results into a detailed report, highlighting which tool performs better under specific conditions and providing insights into the potential use cases for each tool.

Collaboration Opportunities: This project is open for anyone to contribute. Discussions, preliminary findings, and progress updates are encouraged in the comments section. The project may be assigned based on the contribution level and the quality of insights provided.

Product Name

pdfparsing

Organization Name

Samagra

Domain

OCR / Text Recognition

Tech Skills Needed

  • Python
  • OCR technologies
  • Image processing

Category

Research and Development

Feature

PDF parsing

Mentor(s)

@ChakshuGautam

Complexity

Medium

@kabirrajsingh
Copy link

Hi @ChakshuGautam . I am planning to work on this issue. Could you please clarify that we are basically aiming for a dataset which mainly focuses on text in the form of documents, right?
(Not considering other cases such as the text extration from a random picture)

@Gautam-Rajeev
Copy link
Collaborator Author

Hi @ChakshuGautam . I am planning to work on this issue. Could you please clarify that we are basically aiming for a dataset which mainly focuses on text in the form of documents, right? (Not considering other cases such as the text extration from a random picture)

correct. extraction of text from documents

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants