Handle vector-image-converted text in PDFs #261

maxmnemonic · 2024-11-06T15:48:21Z

Requested feature

Our users encounter from time to time documents that instead of text have vector path's representing themselves as text.
Because of the vector nature of it, we do not automatically OCR such pages, and because there is no actual text in such PDF pages, conversion output is usually empty.

Alternatives

As a solution we need to reliably detect such cases, and render such pages into raster images, then running them through standard OCR pipeline. We can use layout model output to detect such cases, if layout model predicts text blocks that don't have programmatic text associated with them, this could be a good indication.

jmmfcoutinho · 2024-11-07T05:09:04Z

Hello,
I am facing this exact challenge, I believe.
I have a pdf file with some titles in some sort of "image" format that does not seem to be captured by the OCR, and is therefore not presented in the markdown text output.
I'll try to find the file and post it here with the extraction output.

deborah-drongoai · 2024-11-07T10:03:00Z

Hi @maxmnemonic
I am facing a similar issue but the behavior is different for markdown export and text export, while converting the following document

The markdown has repetition of letters in a word making it wrong, attaching the screenshot below

The text file does not contain the text, attaching the screenshot below

The code used for conversion is as follows:
`def pdf_converter(source):

# PyPdfium with EasyOCR
# -----------------
pipeline_options = PdfPipelineOptions()
pipeline_options.generate_page_images = True
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE

doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pipeline_options, backend=PyPdfiumDocumentBackend
        )
    }
)

start_time = time.time()
conv_result = doc_converter.convert(source)
end_time = time.time() - start_time

# _log.info(f"Document converted in {end_time:.2f} seconds.")

## Export results
output_dir = Path("scratch")
output_dir.mkdir(parents=True, exist_ok=True)
doc_filename = conv_result.input.file.stem

# print("body", conv_result.document.body)


# Export Deep Search document JSON format:
with (output_dir / f"{doc_filename}.json").open("w", encoding="utf-8") as fp:
    fp.write(json.dumps(conv_result.document.export_to_dict()))

# Export Text format:
with (output_dir / f"{doc_filename}.txt").open("w", encoding="utf-8") as fp:
    fp.write(conv_result.document.export_to_text())

# Export Markdown format:
with (output_dir / f"{doc_filename}.md").open("w", encoding="utf-8") as fp:
    fp.write(conv_result.document.export_to_markdown())

# Export Document Tags format:
with (output_dir / f"{doc_filename}.doctags").open("w", encoding="utf-8") as fp:
    fp.write(conv_result.document.export_to_document_tokens())

`

Could you please explain what might be happening for the inconsistent results for different file types.

maxmnemonic · 2024-11-11T07:52:55Z

Hi @maxmnemonic I am facing a similar issue but the behavior is different for markdown export and text export, while converting the following document...
Could you please explain what might be happening for the inconsistent results for different file types.

@deborah-drongoai, I believe this could be a different issue, but we can check it up, are you converting it from PDF, any chance you could share it with us?

maxmnemonic · 2024-11-11T07:55:08Z

Quick Update: the issue of skipping over text converted to vector images described here, can be handled also with forced full-page OCR, that is being prepared with this PR: feat(OCR): Introduce the OcrOptions.force_full_page_ocr parameter that forces a full page OCR scanning #290

deborah-drongoai · 2024-11-11T12:18:43Z

Hi @maxmnemonic I am facing a similar issue but the behavior is different for markdown export and text export, while converting the following document...
Could you please explain what might be happening for the inconsistent results for different file types.

@deborah-drongoai, I believe this could be a different issue, but we can check it up, are you converting it from PDF, any chance you could share it with us?

@maxmnemonic Thank you for your quick response, sure I will attach the pdf file that I used to work with the module.
rajesh_1026319_20240917055115_stationerypdf_oh_merged.pdf
It would be of great help if I could get some insights with this issue. Thanks of lot.

maxmnemonic added the enhancement New feature or request label Nov 6, 2024

maxmnemonic added the priority:high label Nov 7, 2024

maxmnemonic self-assigned this Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle vector-image-converted text in PDFs #261

Handle vector-image-converted text in PDFs #261

maxmnemonic commented Nov 6, 2024 •

edited

Loading

jmmfcoutinho commented Nov 7, 2024

deborah-drongoai commented Nov 7, 2024 •

edited

Loading

maxmnemonic commented Nov 11, 2024

maxmnemonic commented Nov 11, 2024 •

edited

Loading

deborah-drongoai commented Nov 11, 2024

Handle vector-image-converted text in PDFs #261

Handle vector-image-converted text in PDFs #261

Comments

maxmnemonic commented Nov 6, 2024 • edited Loading

Requested feature

Alternatives

jmmfcoutinho commented Nov 7, 2024

deborah-drongoai commented Nov 7, 2024 • edited Loading

maxmnemonic commented Nov 11, 2024

maxmnemonic commented Nov 11, 2024 • edited Loading

deborah-drongoai commented Nov 11, 2024

maxmnemonic commented Nov 6, 2024 •

edited

Loading

deborah-drongoai commented Nov 7, 2024 •

edited

Loading

maxmnemonic commented Nov 11, 2024 •

edited

Loading