Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pypdfium2 rendering backend #384

Closed
wants to merge 10 commits into from
Closed

Conversation

mara004
Copy link

@mara004 mara004 commented Jun 23, 2023

No description provided.

setup.py Outdated
"pypdfium2>=4,<5",
"pillow",
"ghostscript>=0.7", # deprecate?
"pdftopng>=0.2.3", # deprecate?
Copy link
Author

@mara004 mara004 Jun 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I was not able to get pdftopng installed locally with Python 3.11

@KOLANICH
Copy link
Contributor

"pypdfium2>=4,<5",

< restrictions are considered harmful. Unfortunately pip devs have introduced them and are not going to drop their support or even to introduce a mode to ignore them, so I have even had to create a tool undoing such kind of sabotage from wheels.

@mara004
Copy link
Author

mara004 commented Jun 24, 2023

@KOLANICH Uh, you seem to share some unorthodox opinions on packaging (first lebedov/python-pdfbox#10 (comment), now this).

< bounds are not generally harmful; they are merely protective when the next major release brings API-breaking changes, to ensure the project remains installable/working by default. Bumping to a new major version should be a deliberate decision after review the code is still compatible. See also mindee/doctr#947.

I don't currently have a mind of changing the APIs used in this case, but won't promise anything. If the camelot projects wants unsafe bounds in the hope that it will continue to work without interaction, then be it. But I don't encourage this and am not responsible for any breakage that might arise.

@KOLANICH
Copy link
Contributor

remains installable by default.

It will cause version conflicts and maybe even downgrades.

I don't currently have a mind of changing the APIs used in this case, but won't promise anything.

Let's deal with it when the changes in API will actually break the package. By making this package compatible to both APIs of course. But intentionally breaking package (by specifying < specifier) prematurely is IMHO not a rational thing. I prefer to optimistically assumme that everything is compatible to any further version and treat the incompatibity to the latest available version as a grave bug if that version was released and as a serious bug if that version just landed into master.

am not responsible for any breakage that might arise.

camelot is not maintained by you, so it is not up to you to fix its bugs

@mara004
Copy link
Author

mara004 commented Jun 24, 2023

Hmm yes, these aspects need to be considered. Conflict handling is a problem indeed.
On one hand, upper bounds are good to install the most compatible version if there's no conflict.
On the other, it would be bad if pip aborted on conflict. I'm not 100% sure what it does - I think it downgrades, though.

Apart from that, one can still use virtual envs to tackle hard conflicts.

I think you may be right, in this particular case it could be better to remove the upper bounds, as camelot still has fallback logic here (however, the fallbacks aren't very reliable, since they depend on additional system programs). I'll await a maintainer's opinion before changing it, anyway.


remains installable by default.

It will cause version conflicts and maybe even downgrades.

To be more precise: "so the code remains working by default".

@mara004 mara004 changed the title Add pypdfium2 rendering backend (experimental patch) Add pypdfium2 rendering backend Jul 1, 2023
@mara004
Copy link
Author

mara004 commented Jul 15, 2023

CI currently fails at make install, which is unrelated to this PR's changes. Looks like #387 would fix this.

@foarsitter
Copy link
Contributor

@mara004 by any change, do you have time to rebase this PR on the master branch? Quite a lot has changed in #353.

@mara004
Copy link
Author

mara004 commented Sep 24, 2023

I may be able to check. Supposedly it won't be a complicated rebase, as the changes from this PR are relatively small.

Side-note: Checkout of this repository took unusually long for me, even with --depth 1. IIRC this was much less so when I last visited.

@mara004
Copy link
Author

mara004 commented Sep 24, 2023

I merged in master and fixed the conflicts.

pyproject.toml Show resolved Hide resolved
Fails with `PermissionError: [WinError 32] The process cannot access the
file because it is being used by another process`

That seems to be an issue with camelot rsp. its test suite, not
pypdfium2.
In case the `try/except` captures something other than a classical
ModuleNotFoundError, we want to know what happened (e.g. library load
error)
@mara004
Copy link
Author

mara004 commented Sep 29, 2023

@foarsitter Who is in charge of reviewing/merging PRs here? I'm not keen on having to do another rebase, frankly...

@foarsitter
Copy link
Contributor

@mara004 the project has no lead.

Can you write some brief documentation about why pdfium is the go-to backend?

@mara004
Copy link
Author

mara004 commented Oct 2, 2023

@mara004 the project has no lead.

I meant, who has commit access and would be able to merge this PR?

Can you write some brief documentation about why pdfium is the go-to backend?

Where should I add that? Is a code comment in pdfium_backend.py sufficient?

How about something like this:
pdfium appears liberal-licensed, stable and fast. pypdfium2 bridges pdfium to python using ctypes, regularly building standalone wheels for all major platforms. It does not require additional system packages on runtime.

Notes on possible alternatives:1
pdfbox and pdfjs are the only other liberal-licensed PDF rendering engines known to the author, but they are not properly bridged to python yet and would need additional runtime envs.
Technically, pymupdf seems better than pypdfium2, but has a restrictive license (AGPL).
poppler and ghostscript are slower than pdfium, under restrictive licenses and lack pre-built binaries, or need subprocess/pipe communication.

Footnotes

  1. Does not need to be added to the docs, but FYI

@foarsitter
Copy link
Contributor

Thanks for helping me placing this PR in a broader perspective. A comment would be sufficient.

I will open an issue to discuss the three backends camelot currently offers and maybe we should drop poppler in favor of pdfium.

@bosd
Copy link

bosd commented Aug 29, 2024

@mara004 Thanks for this very interesting pr.

As @foarsitter pointed out, this project has no lead. And this repo has been unmaintained #343 for some while now.
We try to build a maintained fork at pypdf_table_extraction.

Do you want to open the PR against that repo so that we can merge your improvement?

@mara004
Copy link
Author

mara004 commented Aug 29, 2024

Thanks for the update @bosd.

I'd be happy with this change landing in pypdf_table_extraction, but am not interested in doing the transfer personally and going through another PR process.

So, if the maintainers of pypdf_table_extraction want this patch, they may transfer it on their own.
You can probably do something like

wget https://patch-diff.githubusercontent.com/raw/camelot-dev/camelot/pull/384.diff
git apply 384.diff

and then fix merge conflicts, if any.

@mara004
Copy link
Author

mara004 commented Aug 29, 2024

Given that camelot is unmaintained, I'll close this PR here, but you may still use the patch of course.

@mara004 mara004 closed this Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants