Improve pdfminer embedded image extraction in `pdf` partitioning #3456

christinestraub · 2024-07-31T23:18:31Z

Summary

This PR addresses an issue in pdfminer library's embedded image extraction process. Previously, some extracted "images" were incorrect, including embedded text elements, resulting in oversized bounding boxes. This update refines the extraction process to focus on actual images with more accurate, smaller bounding boxes.

Testing

PDF: test_pdfminer_text_extraction.pdf

elements = partition_pdf(
    filename="test_pdfminer_text_extraction",
    strategy=strategy,
    languages=["chi_sim"],
    analysis=True,
)

Results

this PR
main branch

### Summary This PR addresses an issue in `pdfminer` library's embedded image extraction process. Previously, some extracted "images" were incorrect, including embedded text elements, resulting in oversized bounding boxes. This update refines the extraction process to focus on actual images with more accurate, smaller bounding boxes. ### Testing PDF: [test_pdfminer_text_extraction.pdf](https://github.com/user-attachments/files/16448213/test_pdfminer_text_extraction.pdf) ``` elements = partition_pdf( filename="test_pdfminer_text_extraction", strategy=strategy, languages=["chi_sim"], analysis=True, ) ``` **Results** - this `PR` ![page1_layout_pdfminer](https://github.com/user-attachments/assets/098e0a1f-fdad-4627-a881-cbafd71ce5a0) ![page1_layout_final](https://github.com/user-attachments/assets/6dc89180-36ac-424a-99de-63810ebf8958) - `main` branch ![page1_layout_pdfminer](https://github.com/user-attachments/assets/8228995a-2ef1-4b76-9758-b8015c224e6d) ![page1_layout_final](https://github.com/user-attachments/assets/68d43d7b-7270-4f58-8360-dc76bd0df78f)

christinestraub added 4 commits July 31, 2024 09:12

feat: add functionality to extract all inner objects

4f298c7

refactor: remove unused get_images_from_pdf_element

fa53e53

feat: update ImageTextRegion extraction

9cd9254

chore: bump version

be9141d

christinestraub temporarily deployed to ci July 31, 2024 23:29 — with GitHub Actions Inactive

christinestraub changed the title ~~fix: pdfminer extraction in pdf partitioning~~ Improve pdfminer embedded image extractionin pdf partitioning Aug 1, 2024

christinestraub changed the title ~~Improve pdfminer embedded image extractionin pdf partitioning~~ Improve pdfminer embedded image extraction in pdf partitioning Aug 1, 2024

christinestraub marked this pull request as ready for review August 1, 2024 00:27

christinestraub requested review from cragwolfe and MthwRobinson August 1, 2024 00:27

chore: update changelog

109e1c1

christinestraub temporarily deployed to ci August 1, 2024 00:48 — with GitHub Actions Inactive

cragwolfe approved these changes Aug 1, 2024

View reviewed changes

christinestraub mentioned this pull request Aug 1, 2024

bugfix: Recursively extracts text from a pdfminer layout object #3445

Closed