Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle vector-image-converted text in PDFs #261

Open
maxmnemonic opened this issue Nov 6, 2024 · 5 comments
Open

Handle vector-image-converted text in PDFs #261

maxmnemonic opened this issue Nov 6, 2024 · 5 comments
Assignees
Labels
enhancement New feature or request priority:high

Comments

@maxmnemonic
Copy link
Contributor

maxmnemonic commented Nov 6, 2024

Requested feature

Our users encounter from time to time documents that instead of text have vector path's representing themselves as text.
Because of the vector nature of it, we do not automatically OCR such pages, and because there is no actual text in such PDF pages, conversion output is usually empty.

Alternatives

As a solution we need to reliably detect such cases, and render such pages into raster images, then running them through standard OCR pipeline. We can use layout model output to detect such cases, if layout model predicts text blocks that don't have programmatic text associated with them, this could be a good indication.

@maxmnemonic maxmnemonic added the enhancement New feature or request label Nov 6, 2024
@jmmfcoutinho
Copy link

Hello,
I am facing this exact challenge, I believe.
I have a pdf file with some titles in some sort of "image" format that does not seem to be captured by the OCR, and is therefore not presented in the markdown text output.
I'll try to find the file and post it here with the extraction output.

@deborah-drongoai
Copy link

deborah-drongoai commented Nov 7, 2024

Hi @maxmnemonic
I am facing a similar issue but the behavior is different for markdown export and text export, while converting the following document
image
The markdown has repetition of letters in a word making it wrong, attaching the screenshot below
image
The text file does not contain the text, attaching the screenshot below
image

The code used for conversion is as follows:
`def pdf_converter(source):

# PyPdfium with EasyOCR
# -----------------
pipeline_options = PdfPipelineOptions()
pipeline_options.generate_page_images = True
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE

doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pipeline_options, backend=PyPdfiumDocumentBackend
        )
    }
)

start_time = time.time()
conv_result = doc_converter.convert(source)
end_time = time.time() - start_time

# _log.info(f"Document converted in {end_time:.2f} seconds.")

## Export results
output_dir = Path("scratch")
output_dir.mkdir(parents=True, exist_ok=True)
doc_filename = conv_result.input.file.stem

# print("body", conv_result.document.body)


# Export Deep Search document JSON format:
with (output_dir / f"{doc_filename}.json").open("w", encoding="utf-8") as fp:
    fp.write(json.dumps(conv_result.document.export_to_dict()))

# Export Text format:
with (output_dir / f"{doc_filename}.txt").open("w", encoding="utf-8") as fp:
    fp.write(conv_result.document.export_to_text())

# Export Markdown format:
with (output_dir / f"{doc_filename}.md").open("w", encoding="utf-8") as fp:
    fp.write(conv_result.document.export_to_markdown())

# Export Document Tags format:
with (output_dir / f"{doc_filename}.doctags").open("w", encoding="utf-8") as fp:
    fp.write(conv_result.document.export_to_document_tokens())

`

Could you please explain what might be happening for the inconsistent results for different file types.

@maxmnemonic
Copy link
Contributor Author

Hi @maxmnemonic I am facing a similar issue but the behavior is different for markdown export and text export, while converting the following document...
Could you please explain what might be happening for the inconsistent results for different file types.

@deborah-drongoai, I believe this could be a different issue, but we can check it up, are you converting it from PDF, any chance you could share it with us?

@maxmnemonic
Copy link
Contributor Author

maxmnemonic commented Nov 11, 2024

Quick Update: the issue of skipping over text converted to vector images described here, can be handled also with forced full-page OCR, that is being prepared with this PR: feat(OCR): Introduce the OcrOptions.force_full_page_ocr parameter that forces a full page OCR scanning #290

@deborah-drongoai
Copy link

Hi @maxmnemonic I am facing a similar issue but the behavior is different for markdown export and text export, while converting the following document...
Could you please explain what might be happening for the inconsistent results for different file types.

@deborah-drongoai, I believe this could be a different issue, but we can check it up, are you converting it from PDF, any chance you could share it with us?

@maxmnemonic Thank you for your quick response, sure I will attach the pdf file that I used to work with the module.
rajesh_1026319_20240917055115_stationerypdf_oh_merged.pdf
It would be of great help if I could get some insights with this issue. Thanks of lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request priority:high
Projects
None yet
Development

No branches or pull requests

3 participants