Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add decode_as_image() to ContentStreams #2615

Merged
merged 10 commits into from
Jun 9, 2024

Conversation

pubpub-zz
Copy link
Collaborator

closes #2613

Copy link

codecov bot commented May 1, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.14%. Comparing base (3c9f449) to head (604e2b8).
Report is 57 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2615   +/-   ##
=======================================
  Coverage   95.13%   95.14%           
=======================================
  Files          51       51           
  Lines        8538     8547    +9     
  Branches     1702     1703    +1     
=======================================
+ Hits         8123     8132    +9     
  Misses        261      261           
  Partials      154      154           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@stefan6419846
Copy link
Collaborator

Should we really expect the users to basically call decode_image on every object with arbitrary nesting as there might be a "hidden" image somewhere? This feels rather strange.

Additionally, what happens when it is no image? We log a warning, but is there an exception as well due to invalid image data? If yes, why both?

@pubpub-zz
Copy link
Collaborator Author

Should we really expect the users to basically call decode_image on every object with arbitrary nesting as there might be a "hidden" image somewhere? This feels rather strange.

Why strange. This offers a way to get the image from an stream where images are present but not part of the images (such as the use in pattern as provided in B2.pdf, but also in annotations)

Additionally, what happens when it is no image? We log a warning, but is there an exception as well due to invalid image data? If yes, why both?

I thought about this and my concern is that this may hide some actual issues. I've completed the annotation

@stefan6419846
Copy link
Collaborator

I am still not sure whether we can really expect the user to examine every content stream for a possible image. Personally, I would prefer a clean solution, thus I am going to leave this PR open for further discussion.

@pubpub-zz
Copy link
Collaborator Author

I've reviewed quickly the PDF 1.7 spec, and there is many objects not part of the current .images[]. within pages I've found thumbnails, alternate images, and currently patterns, and possibly mask images (as independent images). There is also some images not stores in pages: thumbnaiils within linearized documents and within annotations (such as stamps where images are stores within [/AP][/N][/Resources][/XObjects]).
I may have lost also some elements.

At least providing a function to ease extraction of images for other developers should be an improvements

@stefan6419846
Copy link
Collaborator

In this case, could you please fix the merge conflicts and add some basic example to the docs?

@pubpub-zz
Copy link
Collaborator Author

test doc for example in documentation:
test_stamp.pdf

@stefan6419846 stefan6419846 merged commit 26d1615 into py-pdf:main Jun 9, 2024
16 checks passed
stefan6419846 added a commit that referenced this pull request Jun 23, 2024
## What's new

### New Features (ENH)
- Accept ETen-B5 and UniCNS-UTF16 encodings (#2721) by @pubpub-zz
- Add decode_as_image() to ContentStreams (#2615) by @pubpub-zz
- context manager for PdfReader (#2666) by @tibor-reiss
- Add capability to set font and size in fields (#2636) by @pubpub-zz
- Allow to pass input file without named argument (#2576) by @pubpub-zz

### Bug Fixes (BUG)
- Fix deprecation for Ressources when using old constants (#2705) by @stefan6419846
- Fix images issue 4 bits encoding and LUT starting with UTF16_BOM (#2675) by @pubpub-zz
- Reading large compressed images takes huge time to process (#2644) by @snanda85
- Highlighted Text Cannot Be Printed (#2604) by @Nifury
- Fix UnboundLocalError on malformed pdf (#2619) by @farjasju

### Documentation (DOC)
- Various improvements on docstrings and examples by @j-t-1

### Robustness (ROB)
- Cope with missing Standard 14 fonts in fields (#2677) by @pubpub-zz
- Improve inline image extraction (#2622) by @pubpub-zz
- Cope with loops in Fields tree (#2656) by @pubpub-zz
- Discard /I in choice fields for compatibility with Acrobat (#2614) by @pubpub-zz
- Cope with some issues in pillow (#2595) by @pubpub-zz
- Cope with some image extraction issues (#2591) by @pubpub-zz

### Maintenance (MAINT)
- Deprecate interiour_color with replacement interior_color (#2706) by @j-t-1
- Add deprecate_with_replacement to PdfWriter.find_bookmark (#2674) by @j-t-1

### Code Style (STY)
- Change Link to be a non-markup annotation (#2714) by @j-t-1

[Full Changelog](4.2.0...4.3.0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Images contained in objects of type "/Pattern" are not retrieved
2 participants