Recode does not merge hocr into pdf #69

jcuenod · 2023-09-22T14:50:44Z

For the sake of testing, I'm just trying to get this working with one page:

recode_pdf --from-imagestack 'doc/page_5_fixed.png' \
    --hocr-file doc/page_5_fixed.hocr \
    -o output.pdf

The output pdf does not have the text layer from the hocr file. The hocr is generated by gcv -> hocr tools.

Any idea what I might be doing wrong?

The text was updated successfully, but these errors were encountered:

MerlijnWajer · 2023-09-22T17:52:20Z

Thanks for the report. Do you have an example hOCR file as generated by the gvc->hocr tools?

jcuenod · 2023-09-23T19:47:32Z

Sure, here are the files I was working with above. Github doesn't like the hocr file extension, so it's a txt now.

page_5_fixed.txt

MerlijnWajer · 2023-11-11T23:09:19Z

@jcuenod - I just saw the reply in my inbox, really sorry I didn't see this earlier! I'll investigate tomorrow.

MerlijnWajer · 2023-11-12T09:53:58Z

I figured out the problem, the line elements are in an ocr_carea, without being wrapped in an additional ocrx_block or ocr_par - so the logical elements are missing.

With this simple change to archive-hocr-tools (https://github.com/internetarchive/archive-hocr-tools) the text finding works, and I think at that point the PDF generation will work too.


diff --git a/hocr/parse.py b/hocr/parse.py
index 0b45a2c..e0d6e6b 100644
--- a/hocr/parse.py
+++ b/hocr/parse.py
@@ -314,7 +314,8 @@ def hocr_page_to_word_data_fast(hocr_page):

     has_ocrx_cinfo = 0

-    for par in hocr_page.findall('.//*[@class="ocr_par"]') + hocr_page.findall('.//*[@class="ocrx_block"]'):
+    for par in hocr_page.findall('.//*[@class="ocr_par"]') + hocr_page.findall('.//*[@class="ocrx_block"]') + hocr_page.findall('.//*[@class="ocr_carea"]'):

I need to take a moment to figure out if this is the right change and ensure things don't break elsewhere. There is some discussion here on the tag too kba/hocr-spec#28

MerlijnWajer · 2024-02-01T00:17:27Z

I have been kind of busy but the patch above is will cause problems for other documents (although it might work for you), so the fix a little more complicated. I will try to get a proper fix in place for this. It seems like more users are hitting this.

Since I have switched away from lxml I've been running into some limitations of the xpath of the python standard library, so this might take a bit more trickery to get right. The good news is that I've at least added some tests in the past months, so we could add the hOCR version of your document to the tests when this is fixed, assuming that's OK with you.

jcuenod · 2024-02-08T02:53:02Z

That's fine with me. Thanks for your work on this!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recode does not merge hocr into pdf #69

Recode does not merge hocr into pdf #69

jcuenod commented Sep 22, 2023

MerlijnWajer commented Sep 22, 2023

jcuenod commented Sep 23, 2023

MerlijnWajer commented Nov 11, 2023

MerlijnWajer commented Nov 12, 2023 •

edited

Loading

MerlijnWajer commented Feb 1, 2024

jcuenod commented Feb 8, 2024 •

edited

Loading

Recode does not merge hocr into pdf #69

Recode does not merge hocr into pdf #69

Comments

jcuenod commented Sep 22, 2023

MerlijnWajer commented Sep 22, 2023

jcuenod commented Sep 23, 2023

MerlijnWajer commented Nov 11, 2023

MerlijnWajer commented Nov 12, 2023 • edited Loading

MerlijnWajer commented Feb 1, 2024

jcuenod commented Feb 8, 2024 • edited Loading

MerlijnWajer commented Nov 12, 2023 •

edited

Loading

jcuenod commented Feb 8, 2024 •

edited

Loading