You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Conduct a comprehensive comparison between the OCR (Optical Character Recognition) capabilities of Surya and Pytesseract. The aim is to determine which tool performs better under various conditions and to evaluate if either tool offers unique functionalities not covered by the other. A well-curated test set will be developed to facilitate this comparison.
Description
The objective is to systematically compare Surya and Pytesseract, two leading OCR tools, to understand their strengths and weaknesses in processing different types of text. The comparison should cover various aspects such as accuracy, speed, handling of different languages, and the ability to recognize text in complex backgrounds or with various fonts and sizes. The test set should include a diverse range of images that reflect real-world use cases where OCR might be applied.
Key comparison metrics include:
Text recognition accuracy
Processing speed
Robustness across different image qualities
Support for multiple languages - focus on English, Hindi, Oriya
Ability to recognize text in complex layouts - look at tables, footnotes, charts etc
Implementation Details
To effectively compare Surya and Pytesseract, the following steps will be taken:
Developing a Test Set: Collect and/or create a diverse set of images that include plain text, text over images, handwritten notes, and texts in various fonts and sizes. Ensure the test set covers multiple languages and text orientations.
Benchmarking Criteria: Define clear metrics for comparison, including accuracy (measured by character and word recognition rates), speed (time taken to process images of varying sizes), and error rates across different languages and fonts.
Comparative Analysis: Run both Surya and Pytesseract on the test set, documenting their performance based on the predefined criteria.
Functionality Check: List and compare the features and functionalities offered by both tools, noting any unique capabilities or limitations.
Documentation and Reporting: Compile the results into a detailed report, highlighting which tool performs better under specific conditions and providing insights into the potential use cases for each tool.
Collaboration Opportunities: This project is open for anyone to contribute. Discussions, preliminary findings, and progress updates are encouraged in the comments section. The project may be assigned based on the contribution level and the quality of insights provided.
Hi @ChakshuGautam . I am planning to work on this issue. Could you please clarify that we are basically aiming for a dataset which mainly focuses on text in the form of documents, right?
(Not considering other cases such as the text extration from a random picture)
Hi @ChakshuGautam . I am planning to work on this issue. Could you please clarify that we are basically aiming for a dataset which mainly focuses on text in the form of documents, right? (Not considering other cases such as the text extration from a random picture)
Goal:
Conduct a comprehensive comparison between the OCR (Optical Character Recognition) capabilities of Surya and Pytesseract. The aim is to determine which tool performs better under various conditions and to evaluate if either tool offers unique functionalities not covered by the other. A well-curated test set will be developed to facilitate this comparison.
Description
The objective is to systematically compare Surya and Pytesseract, two leading OCR tools, to understand their strengths and weaknesses in processing different types of text. The comparison should cover various aspects such as accuracy, speed, handling of different languages, and the ability to recognize text in complex backgrounds or with various fonts and sizes. The test set should include a diverse range of images that reflect real-world use cases where OCR might be applied.
Key comparison metrics include:
Implementation Details
To effectively compare Surya and Pytesseract, the following steps will be taken:
Collaboration Opportunities: This project is open for anyone to contribute. Discussions, preliminary findings, and progress updates are encouraged in the comments section. The project may be assigned based on the contribution level and the quality of insights provided.
Product Name
pdfparsing
Organization Name
Samagra
Domain
OCR / Text Recognition
Tech Skills Needed
Category
Research and Development
Feature
PDF parsing
Mentor(s)
@ChakshuGautam
Complexity
Medium
The text was updated successfully, but these errors were encountered: