-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drop Poppler and switch to pdfium2 Backend? #89
Comments
(Wisenheimer: There is no such thing as |
As long as the Poppler backend does not require much attention for maintenance, I would argue to keep it. |
Well, camelot's poppler wrapper |
Actually, should have named the topic better. Ghostscript is the one which should be deprecated. (less permissive license, and performant)
indeed, which is likely the reason it got removed in But according to some benchmarks I found on stackoverflow, pdf2image is slower then Since, pdfium2 performs better then Evaluating the options. |
Regarding licenses: I'm not sure whether copyleft would actually apply if the GPL-licensed program is invoked via subprocess. IANAL and it's probably a legal grey zone, but this might be considered mere "aggregation", where the aggregate may contain both GPL-licensed and GPL-incompatible components. What I don't know is if this also applies to the AGPL, or if it closes this loophole. |
Let's put the usual disclaimer first: IANAL. Usually, the copyleft effect does not affect subprocess calls (unless they are "exchanging complex internal data structures") and thus it should be safe. Speaking of the AGPL, the main difference is that it fixes the SaaS (Software as a Service) loophole by marking remote access as distribution, while the GPL does not require the source code to be provided as long as we are speaking of SaaS - the loophole you mention still exists. If not before, at least at runtime, ghostscript will expose the copyleft effect, marking any direct C level bindings to Ghostscript unusable if we want to avoid the hard copyleft effect of both GPL-3.0-or-later (Python bindings) and APGL-3.0-or-later (Ghostscript itself, linked through ctypes). Directly linking to poppler code through C wrappers can also become an issue if we want to avoid the strong copyleft effect, as it is subject to GPL-2.0-or-later as well. From my experience, each rendering tool has its own limitations - depending on the file itself, some tools might be suited better for a specific type of PDF files. And even poppler ships two different renderers: The old pdftoppm and the new pdftocairo. Both are supported by pdf2image through the |
Thanks for the explanation @stefan6419846! So basically, this project needs to decide whether to use subprocess for licensing advantages, or native bindings for performance advantages, and whether to add/remove any backend. I'd suggest to keep poppler and ghostscript backends for diversity, but to replace the |
About the licensing IANAL, Started on this tour, after this comment. I'm in doubt of his conclusion of:
Are we distributing it? As stated over in the other repo:
If the agpl license of a large company is a problem. So my maybe we can keep the ghostscript in as a fallback. Backends stuff
I agree, but also a bit on the balance end. All the backends need to be maintained. A good middle road would be a solution where we have a simple installation process with good performance. How about this proposal:
The goal of having pdf_table_extraction easily work OOTB would be satisfied by a pypdfium2 backend. The interesting thing is there are some pr's of @mara004 which with a little bit of work could be merged here to achieve point 1 and 2. Edit: This is basically the same as Mara's suggestion 😄 Footnotes |
camelot/pypdf-table-extraction could in theory have any license. But as soon as any C level bindings to Ghostscript are required for installing it, building commercial applications on top of it becomes hard. While we could argue that this is a runtime-only issue, having a package without being able to use it due to the risk of having to make all (internal) business logic available under the terms of the AGPL will limit the usage options of pypdf-table-extraction. In some cases, Ghostscript might already be pre-installed on the system, making the issue less obvious.
While this might be an option for larger companies, smaller companies might have or tend to have issues with commercial licenses as in this case.
As long as Ghostscript is optional and/or subprocess calls are used instead of native wrappers, this would probably be the preferred approach. |
Shall I be the first to mention that dropping poppler for pypdfium2 means that the repo will have both pypdfium2 and pypdf as dependencies? Is the intention to drop pypdf as well? What will become of the name "pypdf_table_extraction"? 👀 |
The same goes when keeping poppler, you need multiple PDF libraries in either case, because pypdf/pdfminer.six don't do PDF rendering. |
I was mistaken, I missed the fact the library is built on pdfminer.six not pypdf. I see, the large amount of pdfminer.six specific functionality makes it difficult to switch backends. |
I've been using pyips as the default backend for quite some time now. Two advantages that we get:
On top of that, it is quite easy to install and is actively maintained well |
That indeed is a very interesting replacement for the current (broken/deprecated) poppler backend. ( The pr for pdfium can be expected soon. First I was focusing on refactors to let flake8 pass ) |
IIRC, libvips can be built with either poppler or (preferably) pdfium. See their Readme:
I'm not sure if compile-time either-or pdf library wrappers are the way to go, if the ability to select different PDF libraries at runtime is desired? That said, vips of course looks like a great image processing library. |
Should we drop Poppler and switch to pdfium2 Backend for the next major release?
camelot-dev#384
The text was updated successfully, but these errors were encountered: