-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recode does not merge hocr into pdf #69
Comments
Thanks for the report. Do you have an example hOCR file as generated by the gvc->hocr tools? |
Sure, here are the files I was working with above. Github doesn't like the hocr file extension, so it's a txt now. |
@jcuenod - I just saw the reply in my inbox, really sorry I didn't see this earlier! I'll investigate tomorrow. |
I figured out the problem, the line elements are in an With this simple change to archive-hocr-tools (https://github.com/internetarchive/archive-hocr-tools) the text finding works, and I think at that point the PDF generation will work too.
I need to take a moment to figure out if this is the right change and ensure things don't break elsewhere. There is some discussion here on the tag too kba/hocr-spec#28 |
I have been kind of busy but the patch above is will cause problems for other documents (although it might work for you), so the fix a little more complicated. I will try to get a proper fix in place for this. It seems like more users are hitting this. Since I have switched away from lxml I've been running into some limitations of the xpath of the python standard library, so this might take a bit more trickery to get right. The good news is that I've at least added some tests in the past months, so we could add the hOCR version of your document to the tests when this is fixed, assuming that's OK with you. |
That's fine with me. Thanks for your work on this! |
For the sake of testing, I'm just trying to get this working with one page:
The output pdf does not have the text layer from the hocr file. The hocr is generated by gcv -> hocr tools.
Any idea what I might be doing wrong?
The text was updated successfully, but these errors were encountered: