Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve pdfminer embedded image extraction in pdf partitioning #3456

Merged
merged 5 commits into from
Aug 1, 2024

Conversation

christinestraub
Copy link
Collaborator

@christinestraub christinestraub commented Jul 31, 2024

Summary

This PR addresses an issue in pdfminer library's embedded image extraction process. Previously, some extracted "images" were incorrect, including embedded text elements, resulting in oversized bounding boxes. This update refines the extraction process to focus on actual images with more accurate, smaller bounding boxes.

Testing

PDF: test_pdfminer_text_extraction.pdf

elements = partition_pdf(
    filename="test_pdfminer_text_extraction",
    strategy=strategy,
    languages=["chi_sim"],
    analysis=True,
)

Results

  • this PR
    page1_layout_pdfminer
    page1_layout_final
  • main branch
    page1_layout_pdfminer
    page1_layout_final

@christinestraub christinestraub changed the title fix: pdfminer extraction in pdf partitioning Improve pdfminer embedded image extractionin pdf partitioning Aug 1, 2024
@christinestraub christinestraub changed the title Improve pdfminer embedded image extractionin pdf partitioning Improve pdfminer embedded image extraction in pdf partitioning Aug 1, 2024
@christinestraub christinestraub marked this pull request as ready for review August 1, 2024 00:27
@christinestraub christinestraub added this pull request to the merge queue Aug 1, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 1, 2024
@christinestraub christinestraub added this pull request to the merge queue Aug 1, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 1, 2024
@christinestraub christinestraub added this pull request to the merge queue Aug 1, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 1, 2024
@christinestraub christinestraub added this pull request to the merge queue Aug 1, 2024
github-merge-queue bot pushed a commit that referenced this pull request Aug 1, 2024
### Summary
This PR addresses an issue in `pdfminer` library's embedded image
extraction process. Previously, some extracted "images" were incorrect,
including embedded text elements, resulting in oversized bounding boxes.
This update refines the extraction process to focus on actual images
with more accurate, smaller bounding boxes.

### Testing
PDF:
[test_pdfminer_text_extraction.pdf](https://github.com/user-attachments/files/16448213/test_pdfminer_text_extraction.pdf)

```
elements = partition_pdf(
    filename="test_pdfminer_text_extraction",
    strategy=strategy,
    languages=["chi_sim"],
    analysis=True,
)
```
**Results**
- this `PR`

![page1_layout_pdfminer](https://github.com/user-attachments/assets/098e0a1f-fdad-4627-a881-cbafd71ce5a0)

![page1_layout_final](https://github.com/user-attachments/assets/6dc89180-36ac-424a-99de-63810ebf8958)
- `main` branch

![page1_layout_pdfminer](https://github.com/user-attachments/assets/8228995a-2ef1-4b76-9758-b8015c224e6d)

![page1_layout_final](https://github.com/user-attachments/assets/68d43d7b-7270-4f58-8360-dc76bd0df78f)
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 1, 2024
@christinestraub christinestraub added this pull request to the merge queue Aug 1, 2024
Merged via the queue into main with commit 0f05718 Aug 1, 2024
51 checks passed
@christinestraub christinestraub deleted the fix/pdf-pdfminer-extraction branch August 1, 2024 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants