-
Notifications
You must be signed in to change notification settings - Fork 447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle vector-image-converted text in PDFs #261
Comments
Hello, |
Hi @maxmnemonic The code used for conversion is as follows:
` Could you please explain what might be happening for the inconsistent results for different file types. |
@deborah-drongoai, I believe this could be a different issue, but we can check it up, are you converting it from PDF, any chance you could share it with us? |
Quick Update: the issue of skipping over text converted to vector images described here, can be handled also with forced full-page OCR, that is being prepared with this PR: feat(OCR): Introduce the OcrOptions.force_full_page_ocr parameter that forces a full page OCR scanning #290 |
@maxmnemonic Thank you for your quick response, sure I will attach the pdf file that I used to work with the module. |
Requested feature
Our users encounter from time to time documents that instead of text have vector path's representing themselves as text.
Because of the vector nature of it, we do not automatically OCR such pages, and because there is no actual text in such PDF pages, conversion output is usually empty.
Alternatives
As a solution we need to reliably detect such cases, and render such pages into raster images, then running them through standard OCR pipeline. We can use layout model output to detect such cases, if layout model predicts text blocks that don't have programmatic text associated with them, this could be a good indication.
The text was updated successfully, but these errors were encountered: