Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recode does not merge hocr into pdf #69

Open
jcuenod opened this issue Sep 22, 2023 · 6 comments
Open

Recode does not merge hocr into pdf #69

jcuenod opened this issue Sep 22, 2023 · 6 comments

Comments

@jcuenod
Copy link

jcuenod commented Sep 22, 2023

For the sake of testing, I'm just trying to get this working with one page:

recode_pdf --from-imagestack 'doc/page_5_fixed.png' \
    --hocr-file doc/page_5_fixed.hocr \
    -o output.pdf

The output pdf does not have the text layer from the hocr file. The hocr is generated by gcv -> hocr tools.

Any idea what I might be doing wrong?

@MerlijnWajer
Copy link
Collaborator

Thanks for the report. Do you have an example hOCR file as generated by the gvc->hocr tools?

@jcuenod
Copy link
Author

jcuenod commented Sep 23, 2023

Sure, here are the files I was working with above. Github doesn't like the hocr file extension, so it's a txt now.

page_5_fixed
page_5_fixed.txt

@MerlijnWajer
Copy link
Collaborator

@jcuenod - I just saw the reply in my inbox, really sorry I didn't see this earlier! I'll investigate tomorrow.

@MerlijnWajer
Copy link
Collaborator

MerlijnWajer commented Nov 12, 2023

I figured out the problem, the line elements are in an ocr_carea, without being wrapped in an additional ocrx_block or ocr_par - so the logical elements are missing.

With this simple change to archive-hocr-tools (https://github.com/internetarchive/archive-hocr-tools) the text finding works, and I think at that point the PDF generation will work too.


diff --git a/hocr/parse.py b/hocr/parse.py
index 0b45a2c..e0d6e6b 100644
--- a/hocr/parse.py
+++ b/hocr/parse.py
@@ -314,7 +314,8 @@ def hocr_page_to_word_data_fast(hocr_page):

     has_ocrx_cinfo = 0

-    for par in hocr_page.findall('.//*[@class="ocr_par"]') + hocr_page.findall('.//*[@class="ocrx_block"]'):
+    for par in hocr_page.findall('.//*[@class="ocr_par"]') + hocr_page.findall('.//*[@class="ocrx_block"]') + hocr_page.findall('.//*[@class="ocr_carea"]'):

I need to take a moment to figure out if this is the right change and ensure things don't break elsewhere. There is some discussion here on the tag too kba/hocr-spec#28

@MerlijnWajer
Copy link
Collaborator

I have been kind of busy but the patch above is will cause problems for other documents (although it might work for you), so the fix a little more complicated. I will try to get a proper fix in place for this. It seems like more users are hitting this.

Since I have switched away from lxml I've been running into some limitations of the xpath of the python standard library, so this might take a bit more trickery to get right. The good news is that I've at least added some tests in the past months, so we could add the hOCR version of your document to the tests when this is fixed, assuming that's OK with you.

@jcuenod
Copy link
Author

jcuenod commented Feb 8, 2024

That's fine with me. Thanks for your work on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants