Shuffled text in native PDF #221

JbIPS · 2024-10-01T09:48:09Z

Hi,

I'm extracting data from PDF with native text and some rows of the table have their content shuffled, as you can see in this live example or here:

vs

I'm using Tessaract as OCR but if I understood well, it should not be used since the text is native. I also saw that behavior with some bold text (but not all), I don't know if it's related.

Is there a workaround? Maybe some misused params on my configuration?

Thank you

TianqiWang1 · 2024-10-30T17:25:31Z

I encountered similar issues when extracting table from PDF - some word orders are reversed. Have you figured this out?

JbIPS · 2024-10-30T17:31:10Z

I didn't. I tried a workaround with pattern matching because my use case only need to know if a kind of substring exists, but it's harder when the words are in reverse.

Are you using Tessaract too? I don't think it's related but maybe I'm wrong and it's the source of the issue

TianqiWang1 · 2024-10-30T17:58:02Z

Yes I'm passing in Tesseract too. But same as your case, I assumed native text extraction was actually used.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shuffled text in native PDF #221

Shuffled text in native PDF #221

JbIPS commented Oct 1, 2024

TianqiWang1 commented Oct 30, 2024

JbIPS commented Oct 30, 2024

TianqiWang1 commented Oct 30, 2024

Shuffled text in native PDF #221

Shuffled text in native PDF #221

Comments

JbIPS commented Oct 1, 2024

TianqiWang1 commented Oct 30, 2024

JbIPS commented Oct 30, 2024

TianqiWang1 commented Oct 30, 2024