Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shuffled text in native PDF #221

Open
JbIPS opened this issue Oct 1, 2024 · 3 comments
Open

Shuffled text in native PDF #221

JbIPS opened this issue Oct 1, 2024 · 3 comments

Comments

@JbIPS
Copy link

JbIPS commented Oct 1, 2024

Hi,

I'm extracting data from PDF with native text and some rows of the table have their content shuffled, as you can see in this live example or here:
image
vs
image

I'm using Tessaract as OCR but if I understood well, it should not be used since the text is native. I also saw that behavior with some bold text (but not all), I don't know if it's related.

Is there a workaround? Maybe some misused params on my configuration?

Thank you

@TianqiWang1
Copy link

I encountered similar issues when extracting table from PDF - some word orders are reversed. Have you figured this out?

@JbIPS
Copy link
Author

JbIPS commented Oct 30, 2024

I didn't. I tried a workaround with pattern matching because my use case only need to know if a kind of substring exists, but it's harder when the words are in reverse.

Are you using Tessaract too? I don't think it's related but maybe I'm wrong and it's the source of the issue

@TianqiWang1
Copy link

Yes I'm passing in Tesseract too. But same as your case, I assumed native text extraction was actually used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants