Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop Poppler and switch to pdfium2 Backend? #89

Open
bosd opened this issue Aug 28, 2024 · 15 comments
Open

Drop Poppler and switch to pdfium2 Backend? #89

bosd opened this issue Aug 28, 2024 · 15 comments
Labels
breaking Breaking Changes enhancement New feature or request help wanted Extra attention is needed question Further information is requested

Comments

@bosd
Copy link
Collaborator

bosd commented Aug 28, 2024

Should we drop Poppler and switch to pdfium2 Backend for the next major release?

camelot-dev#384

@bosd bosd added enhancement New feature or request help wanted Extra attention is needed question Further information is requested breaking Breaking Changes labels Aug 28, 2024
@bosd bosd mentioned this issue Aug 28, 2024
25 tasks
@bosd bosd pinned this issue Aug 29, 2024
@mara004
Copy link

mara004 commented Aug 29, 2024

(Wisenheimer: There is no such thing as pdfium2. The binding is called pypdfium2 (successor of pypdfium), but the C library is just pdfium. But otherwise, yes, I'd appreciate a pypdfium2 backend.)

@stefan6419846
Copy link

As long as the Poppler backend does not require much attention for maintenance, I would argue to keep it.

@mara004
Copy link

mara004 commented Aug 30, 2024

Well, camelot's poppler wrapper pdftopng is unmaintained and fraught with installation issues.
Also, pdftopng is called via subprocess, which adds overhead. If poppler, then a native binding would be better.

@bosd
Copy link
Collaborator Author

bosd commented Sep 2, 2024

Actually, should have named the topic better. Ghostscript is the one which should be deprecated. (less permissive license, and performant)
The poppler wrapper pdftopng is still here in the code, but it is not actively tested or added as a depenency.

camelot's poppler wrapper pdftopng is unmaintained and fraught with installation issues.

indeed, which is likely the reason it got removed in camelot-fork. (Which this repo is based upon).
Theoretically, we could re-add it as a dependency. And take over maintenaince of that package.
But maybe better that we use an alternative.
Which could be pdf2image.

But according to some benchmarks I found on stackoverflow, pdf2image is slower then pdfium2.
pylibvips seems to be an good alternative as well. (Can work with both poppler and pdfium backend.)

Since, pdfium2 performs better then pdf2image, it would not make sense to add pdf2image as a backend

Evaluating the options.
There is/(was) a PR, for pdfium2 it is the easiest way to implement a new backend.
It satisfies the needs of easy installation. (Just a pip install) and has a more liberal license.
Wyt, Is it good to use that as the single backend to raterize PDF's?

@mara004
Copy link

mara004 commented Sep 2, 2024

Regarding licenses: I'm not sure whether copyleft would actually apply if the GPL-licensed program is invoked via subprocess. IANAL and it's probably a legal grey zone, but this might be considered mere "aggregation", where the aggregate may contain both GPL-licensed and GPL-incompatible components.
IMO this kind of defeats the point of the license, but it might be allowed.

What I don't know is if this also applies to the AGPL, or if it closes this loophole.

@stefan6419846
Copy link

Let's put the usual disclaimer first: IANAL.

Usually, the copyleft effect does not affect subprocess calls (unless they are "exchanging complex internal data structures") and thus it should be safe. Speaking of the AGPL, the main difference is that it fixes the SaaS (Software as a Service) loophole by marking remote access as distribution, while the GPL does not require the source code to be provided as long as we are speaking of SaaS - the loophole you mention still exists.

If not before, at least at runtime, ghostscript will expose the copyleft effect, marking any direct C level bindings to Ghostscript unusable if we want to avoid the hard copyleft effect of both GPL-3.0-or-later (Python bindings) and APGL-3.0-or-later (Ghostscript itself, linked through ctypes).

Directly linking to poppler code through C wrappers can also become an issue if we want to avoid the strong copyleft effect, as it is subject to GPL-2.0-or-later as well.

From my experience, each rendering tool has its own limitations - depending on the file itself, some tools might be suited better for a specific type of PDF files. And even poppler ships two different renderers: The old pdftoppm and the new pdftocairo. Both are supported by pdf2image through the use_pdftocairo parameter, defaulting to pdftoppm at the moment.

@mara004
Copy link

mara004 commented Sep 2, 2024

Thanks for the explanation @stefan6419846!
Sorry, I forgot camelot is calling ghostscript through ctypes (I presumed it would use subprocess, like ocrmypdf does).
pdftopng is currently integrated via subprocess, though.

So basically, this project needs to decide whether to use subprocess for licensing advantages, or native bindings for performance advantages, and whether to add/remove any backend.

I'd suggest to keep poppler and ghostscript backends for diversity, but to replace the pdftopng wrapper with some other poppler interface, and add a liberal-licensed alternative like pdfium or maybe pdfbox so you can use native bindings for the (A)GPL-licensed backends.

@bosd
Copy link
Collaborator Author

bosd commented Sep 2, 2024

About the licensing IANAL,

Started on this tour, after this comment.
After doing some more reading of understanding-the-agpl-the-most-misunderstood-license and what-are-apache-gpl-and-agpl-licenses
And of course the ghostscript license.

I'm in doubt of his conclusion of:

I mean if you use AGPL package, by default means that you need to distribute it under AGPL license only.

Are we distributing it? As stated over in the other repo:

he/she needs to download and install Ghostscript separately.

If the agpl license of a large company is a problem.
(large-company-x-is-telling-me-that-they-cant-use-agpl-what-should-i-do)
They could opt to go for the commercial ghostscript licensed version.

So my maybe we can keep the ghostscript in as a fallback.
But as if there is a lot of cunfusion about it licensing, (installation issues). Maybe better to remove it.

Backends stuff

From my experience, each rendering tool has its own limitations - depending on the file itself, some tools might be suited better for a specific type of PDF files.

I agree, but also a bit on the balance end. All the backends need to be maintained.
And it is a lot of work to support a lot of them and keep them running. We are still with a tiny group of contributors here.

A good middle road would be a solution where we have a simple installation process with good performance.
With the posibility for the power users to use more backends.

How about this proposal:

  1. Add pypdfium2 as the default backend.
  2. Replace the broken implementation of pdftopng1 with pdf2image2.
  3. Keep ghostscript (for now)

The goal of having pdf_table_extraction easily work OOTB would be satisfied by a pypdfium2 backend.

The interesting thing is there are some pr's of @mara004 which with a little bit of work could be merged here to achieve point 1 and 2.

Edit: This is basically the same as Mara's suggestion 😄

Footnotes

  1. Unless someone wants to continue the maintainance of this pdftopng within 2 months

  2. Or another native binding to poppler, if someone is willing to put in a pr.

@stefan6419846
Copy link

I'm in doubt of his conclusion of:

I mean if you use AGPL package, by default means that you need to distribute it under AGPL license only.

Are we distributing it? As stated over in the other repo:

he/she needs to download and install Ghostscript separately.

camelot/pypdf-table-extraction could in theory have any license. But as soon as any C level bindings to Ghostscript are required for installing it, building commercial applications on top of it becomes hard. While we could argue that this is a runtime-only issue, having a package without being able to use it due to the risk of having to make all (internal) business logic available under the terms of the AGPL will limit the usage options of pypdf-table-extraction.

In some cases, Ghostscript might already be pre-installed on the system, making the issue less obvious.

If the agpl license of a large company is a problem.
(large-company-x-is-telling-me-that-they-cant-use-agpl-what-should-i-do)
They could opt to go for the commercial ghostscript licensed version.

While this might be an option for larger companies, smaller companies might have or tend to have issues with commercial licenses as in this case.

So my maybe we can keep the ghostscript in as a fallback.
But as if there is a lot of cunfusion about it licensing, (installation issues). Maybe better to remove it.

As long as Ghostscript is optional and/or subprocess calls are used instead of native wrappers, this would probably be the preferred approach.

@conjuncts
Copy link

Shall I be the first to mention that dropping poppler for pypdfium2 means that the repo will have both pypdfium2 and pypdf as dependencies? Is the intention to drop pypdf as well? What will become of the name "pypdf_table_extraction"? 👀

@mara004
Copy link

mara004 commented Oct 12, 2024

Shall I be the first to mention that dropping poppler for pypdfium2 means that the repo will have both pypdfium2 and pypdf as dependencies? Is the intention to drop pypdf as well? What will become of the name "pypdf_table_extraction"? 👀

The same goes when keeping poppler, you need multiple PDF libraries in either case, because pypdf/pdfminer.six don't do PDF rendering.
The name's still OK, as this project is part of the py-pdf organization, and also partly uses pypdf.

@conjuncts
Copy link

I was mistaken, I missed the fact the library is built on pdfminer.six not pypdf. I see, the large amount of pdfminer.six specific functionality makes it difficult to switch backends.

@snanda85
Copy link

I've been using pyips as the default backend for quite some time now.

Two advantages that we get:

  1. It has a cffi-API mode (and a direct libpoppler integration). This combination gives a visible performance boost.
  2. It gives me the ability to preprocess the loaded file before actual table extraction. (use-case specific preprocessing improves line detections and table extractions for us)

On top of that, it is quite easy to install and is actively maintained well

@bosd
Copy link
Collaborator Author

bosd commented Oct 23, 2024

I've been using pyips as the default backend for quite some time now.

That indeed is a very interesting replacement for the current (broken/deprecated) poppler backend.

( The pr for pdfium can be expected soon. First I was focusing on refactors to let flake8 pass )

@mara004
Copy link

mara004 commented Oct 23, 2024

IIRC, libvips can be built with either poppler or (preferably) pdfium. See their Readme:

If present, libvips will attempt to load PDFs with PDFium. [...]
If PDFium is not detected, libvips will look for poppler-glib instead.

I'm not sure if compile-time either-or pdf library wrappers are the way to go, if the ability to select different PDF libraries at runtime is desired? That said, vips of course looks like a great image processing library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking Changes enhancement New feature or request help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants