Just some other errors with the current version. I can't get the current version to work with a hocr-file coming from pdftree to get out the current searchable text from a PDF #37

rmast · 2022-01-08T11:47:55Z

I now work with a hocr-file coming from pdftree to get out the current searchable text from a PDF as suggested on the bottom of this issue:
ocropus/hocr-tools#117

recode_pdf --from-imagestack './2022-01-08*.tif' --hocr-file anonymized.hocr --dpi 400 --bg-downsample 3 --mask-compression jbig2 -o 2022-01-08a.pdf
Traceback (most recent call last):
File "/usr/local/bin/recode_pdf", line 4, in
import('pkg_resources').run_script('archive-pdf-tools==1.4.11', 'recode_pdf')
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 667, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 1463, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/EGG-INFO/scripts/recode_pdf", line 288, in
res = recode(args.from_pdf, args.from_imagestack, args.dpi, args.hocr_file,
File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/internetarchivepdf/recode.py", line 741, in recode
outdoc.save(outfile, deflate=True, pretty=True)
File "/usr/local/lib/python3.8/dist-packages/PyMuPDF-1.19.2-py3.8-linux-x86_64.egg/fitz/fitz.py", line 4416, in save
raise ValueError("cannot save with zero pages")
ValueError: cannot save with zero pages

recode_pdf --from-pdf Afbeeldingen/scantailorin/out/2022-01-08a.pdf --hocr-file anonymized.hocr --dpi 400 --bg-downsample 3 --mask-compression jbig2 -o 220108uitvoer.pdf
Traceback (most recent call last):
File "/usr/local/bin/recode_pdf", line 4, in
import('pkg_resources').run_script('archive-pdf-tools==1.4.11', 'recode_pdf')
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 667, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 1463, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/EGG-INFO/scripts/recode_pdf", line 288, in
res = recode(args.from_pdf, args.from_imagestack, args.dpi, args.hocr_file,
File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/internetarchivepdf/recode.py", line 741, in recode
outdoc.save(outfile, deflate=True, pretty=True)
File "/usr/local/lib/python3.8/dist-packages/PyMuPDF-1.19.2-py3.8-linux-x86_64.egg/fitz/fitz.py", line 4416, in save
raise ValueError("cannot save with zero pages")
ValueError: cannot save with zero pages

Even if I leave out the hocr-file in the hope the input PDF should be already taken for the searchable text inside there's still an error:
recode_pdf --from-pdf Afbeeldingen/scantailorin/out/2022-01-08a.pdf -o 220108uitvoer.pdf
Traceback (most recent call last):
File "/usr/local/bin/recode_pdf", line 4, in
import('pkg_resources').run_script('archive-pdf-tools==1.4.11', 'recode_pdf')
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 667, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 1463, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/EGG-INFO/scripts/recode_pdf", line 288, in
res = recode(args.from_pdf, args.from_imagestack, args.dpi, args.hocr_file,
File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/internetarchivepdf/recode.py", line 628, in recode
create_tess_textonly_pdf(hocr_file, tess_tmp_path, in_pdf=in_pdf,
File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/internetarchivepdf/recode.py", line 110, in create_tess_textonly_pdf
for idx, hocr_page in enumerate(hocr_iter):
File "/usr/local/lib/python3.8/dist-packages/archive_hocr_tools-1.1.13-py3.8.egg/hocr/parse.py", line 42, in hocr_page_iterator
fp.seek(0)
AttributeError: 'NoneType' object has no attribute 'seek'

I anonymized the hocr by :%s/>.*</span>/>bla</span>
anonymized.zip

MerlijnWajer · 2022-01-08T12:29:27Z

Thanks for the report, I'll take a look as to why that hOCR is not accepted.

Somewhat related, I've been working on my own PDF -> hOCR based on PyMuPDF text extraction (so it can stay pure Python + mupdf): https://github.com/internetarchive/archive-hocr-tools/blob/master/bin/pdf-to-hocr - not production ready though.

MerlijnWajer · 2022-01-08T20:09:33Z

It looks like it doesn't want to parse the hOCR file you shared because it does not contain the required namespace. I will see what I can do to work around this, since if I don't prefix the namespace, it will not parse documents with the namespace.

The other problem seems to be that this uses ocrx_block as opposed to ocr_par. I suppose I'll want to support both in the xpath queries, then.

MerlijnWajer · 2022-01-08T20:16:30Z

I think this is fixed in internetarchive/archive-hocr-tools@6cdb14d - I will do a bit more testing before I make a release, though.

MerlijnWajer · 2022-01-08T20:16:35Z

Thanks for the report!

rmast · 2022-01-08T22:22:08Z

I think this is fixed in internetarchive/archive-hocr-tools@6cdb14d

@MerlijnWajer You point to a commit in the archive-hocr-tools repo for the solution of this issue in the archive-pdf-tools repo. Will archive-pdf-tools support hocr-files coming from different sources, or will you just make it read the text from an existing searchable pdf?

rmast · 2022-01-08T22:29:35Z

@MerlijnWajer I also made a hocr file via another route: djvu2hocr ~/Afbeeldingen/2022-01-08.djvu >220108.hocr
220108.zip
This one gives
recode_pdf --from-imagestack ~/Afbeeldingen/211115-000ga.tif --hocr-file ~/jwilk/ocrodjvu/220108.hocr --dpi 300 --bg-downsample 3 --mask-compression jbig2 -o 2022-01-08a.pdf
Traceback (most recent call last):
File "/usr/local/bin/recode_pdf", line 4, in
import('pkg_resources').run_script('archive-pdf-tools==1.4.11', 'recode_pdf')
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 667, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 1463, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/EGG-INFO/scripts/recode_pdf", line 288, in
res = recode(args.from_pdf, args.from_imagestack, args.dpi, args.hocr_file,
File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/internetarchivepdf/recode.py", line 628, in recode
create_tess_textonly_pdf(hocr_file, tess_tmp_path, in_pdf=in_pdf,
File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/internetarchivepdf/recode.py", line 210, in create_tess_textonly_pdf
word_data = hocr_page_to_word_data(hocr_page, font_scaler)
File "/home/robert/.local/lib/python3.8/site-packages/hocr/parse.py", line 185, in hocr_page_to_word_data
conf = int(X_WCONF_REGEX.search(word.attrib['title']).group(1).split()[0])
AttributeError: 'NoneType' object has no attribute 'group'

MerlijnWajer · 2022-01-08T22:30:48Z

archive-pdf-tools relies on archive-hocr-tools to parse hOCR files, so what I will do is:

release new archive-hocr-tools
release new archive-pdf-tools that depends on the newer archive-hocr-tools

For your other question - about keeping the text layer intact from an existing PDF is another matter, there are a few things I want to do there ultimately:

I want to support just compressing images in a PDF and not touch anything else (preserve text layers) - this is not currently what --from-pdf does, it just reads one image per page and recompresses it (and it assumes you have hOCR for it).
Be able to create hOCR from text layers of existing PDFs and use that to either (re)generate the text layer, but also as input for the MRC algorithm.

For this request, #28 a better issue to comment/discuss I think.

MerlijnWajer · 2022-01-08T22:32:47Z

@MerlijnWajer I also made a hocr file via another route: djvu2hocr ~/Afbeeldingen/2022-01-08.djvu >220108.hocr 220108.zip This one gives recode_pdf --from-imagestack ~/Afbeeldingen/211115-000ga.tif --hocr-file ~/jwilk/ocrodjvu/220108.hocr --dpi 300 --bg-downsample 3 --mask-compression jbig2 -o 2022-01-08a.pdf Traceback (most recent call last): File "/usr/local/bin/recode_pdf", line 4, in import('pkg_resources').run_script('archive-pdf-tools==1.4.11', 'recode_pdf') File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 667, in run_script self.require(requires)[0].run_script(script_name, ns) File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 1463, in run_script exec(code, namespace, namespace) File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/EGG-INFO/scripts/recode_pdf", line 288, in res = recode(args.from_pdf, args.from_imagestack, args.dpi, args.hocr_file, File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/internetarchivepdf/recode.py", line 628, in recode create_tess_textonly_pdf(hocr_file, tess_tmp_path, in_pdf=in_pdf, File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/internetarchivepdf/recode.py", line 210, in create_tess_textonly_pdf word_data = hocr_page_to_word_data(hocr_page, font_scaler) File "/home/robert/.local/lib/python3.8/site-packages/hocr/parse.py", line 185, in hocr_page_to_word_data conf = int(X_WCONF_REGEX.search(word.attrib['title']).group(1).split()[0]) AttributeError: 'NoneType' object has no attribute 'group'

Ok, I'll fix that bug as well, thanks, reopening.

rmast · 2022-01-10T22:00:12Z

x_wconf gaat over de confidence. Die krijg je niet terug uit een searchable PDF.

rmast · 2022-01-11T20:03:12Z

When I comment out the confidence-line "archive-hocr-tools/hocr/parse.py" line 186
conf = 100 # int(X_WCONF_REGEX.search(word.attrib['title']).group(1).split()[0])

the conversion takes place.

The hocr-file coming from pdftotree has another scale than the hocr-file coming from djvu2hocr.
The route via djvu2hocr gives a MRC-pdf with the characters on the right positions.
The route via pdftotree (with dimensions that are only 400/72 = 5,556 the size of the dimensions of the other file) gives text behind only in the left uppercorner of the PDF. So the mapping of the HOCR on the images should take account of this 72dpi positional conversion in a PDF.

Both don't have the scan_res marker. pdftotree doesn't order the words right to readable lines, so the djvu-route seems more stable.

rmast · 2022-01-11T21:08:54Z

Your pdf-to-hocr does give a result that gives a readable text when copied from the PDF, but with too many line endings. During the selection the selectedtexts are a bit slanted.

rmast · 2022-01-11T21:27:37Z

The resulting pdf from recode_pdf with the options on the central readme is more than 3 times as big as the PDF resulting from DjVuSolo3.1/DjVuToy.

MerlijnWajer · 2022-01-13T12:48:50Z

(Ik geef maar ff antwoord in Engels voor de andere mensen :-) )

@rmast - sorry for the delay, I did read your messages earlier and yes, the problem is indeed that it is expecting x_wconf to be there, even though it is optional, I will fix that.

Regarding the text selection, it is the same code that Tesseract uses (more or less), but it is possible there are too many line endings added in the conversation. Optimal would be to not have to re-create the text layer, as discussed earlier.

Regarding the compressed size, if you can share the files I can look to see if something can be improved.

rmast · 2022-01-13T19:58:12Z

The files were greylevel scans of a black and white book, meant to end up only in a thresholded jbig2-image. DjVuSolo does some heuristic optimizations.

MerlijnWajer · 2022-01-22T19:16:08Z

I've released archive-hocr-tools 1.1.15 that should fix the word confidence problems as well. Please let me know if it works.

rmast · 2022-01-26T23:28:21Z

I've pulled the newest version and built and installed it. It comes to an end as it did with my dirty fix. However I get this warning for every page:
Deprecation: 'getImageList' removed from class 'Page' after v1.19.0 - use 'get_images'.
Deprecation: 'extractImage' removed from class 'Document' after v1.19.0 - use 'extract_image'.

MerlijnWajer · 2022-01-26T23:39:44Z

Ah, that's likely in the --from-pdf code path? I still need to give that some more attention. I'll fix that here and make sure it's fixed in the next release.

MerlijnWajer · 2022-01-26T23:42:23Z

See 07ff850

MerlijnWajer closed this as completed Jan 8, 2022

MerlijnWajer reopened this Jan 8, 2022

MerlijnWajer closed this as completed Jan 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Just some other errors with the current version. I can't get the current version to work with a hocr-file coming from pdftree to get out the current searchable text from a PDF #37

Just some other errors with the current version. I can't get the current version to work with a hocr-file coming from pdftree to get out the current searchable text from a PDF #37

rmast commented Jan 8, 2022

MerlijnWajer commented Jan 8, 2022

MerlijnWajer commented Jan 8, 2022

MerlijnWajer commented Jan 8, 2022 •

edited

Loading

MerlijnWajer commented Jan 8, 2022 •

edited

Loading

rmast commented Jan 8, 2022

rmast commented Jan 8, 2022

MerlijnWajer commented Jan 8, 2022 •

edited

Loading

MerlijnWajer commented Jan 8, 2022

rmast commented Jan 10, 2022

rmast commented Jan 11, 2022 •

edited

Loading

rmast commented Jan 11, 2022

rmast commented Jan 11, 2022

MerlijnWajer commented Jan 13, 2022

rmast commented Jan 13, 2022 via email •

edited

Loading

MerlijnWajer commented Jan 22, 2022

rmast commented Jan 26, 2022

MerlijnWajer commented Jan 26, 2022 •

edited

Loading

MerlijnWajer commented Jan 26, 2022 •

edited

Loading

Just some other errors with the current version. I can't get the current version to work with a hocr-file coming from pdftree to get out the current searchable text from a PDF #37

Just some other errors with the current version. I can't get the current version to work with a hocr-file coming from pdftree to get out the current searchable text from a PDF #37

Comments

rmast commented Jan 8, 2022

MerlijnWajer commented Jan 8, 2022

MerlijnWajer commented Jan 8, 2022

MerlijnWajer commented Jan 8, 2022 • edited Loading

MerlijnWajer commented Jan 8, 2022 • edited Loading

rmast commented Jan 8, 2022

rmast commented Jan 8, 2022

MerlijnWajer commented Jan 8, 2022 • edited Loading

MerlijnWajer commented Jan 8, 2022

rmast commented Jan 10, 2022

rmast commented Jan 11, 2022 • edited Loading

rmast commented Jan 11, 2022

rmast commented Jan 11, 2022

MerlijnWajer commented Jan 13, 2022

rmast commented Jan 13, 2022 via email • edited Loading

MerlijnWajer commented Jan 22, 2022

rmast commented Jan 26, 2022

MerlijnWajer commented Jan 26, 2022 • edited Loading

MerlijnWajer commented Jan 26, 2022 • edited Loading

MerlijnWajer commented Jan 8, 2022 •

edited

Loading

MerlijnWajer commented Jan 8, 2022 •

edited

Loading

MerlijnWajer commented Jan 8, 2022 •

edited

Loading

rmast commented Jan 11, 2022 •

edited

Loading

rmast commented Jan 13, 2022 via email •

edited

Loading

MerlijnWajer commented Jan 26, 2022 •

edited

Loading

MerlijnWajer commented Jan 26, 2022 •

edited

Loading