Future Road Map #154

tataganesh · 2018-06-16T15:00:55Z

Hi everyone!

Since the inception of pdfminer.six, a lot of improvements have been made, and several issues have been fixed. I would first like to thank every contributor for having kept this project alive! We are all well aware of the difficulties of parsing PDF documents, and I am sure pdfminer.six has made it easier for developers to extract information from PDFs.

But, there are more issues cropping up, and a lot of PRs are pending as well. Documentation, too, is pending. With the increase in these incomplete tasks, it is time that we decide on how to take this project forward. There needs to be a road map created for the future development of this project. It is not necessary to have people completely dedicated to it, but we at least need to create some specific targets / goals, so that the project becomes more concrete and we can ensure its stability.

For starters, I have reached out to an audience through the dev.to platform - dev.to so that more people can become aware of this wonderful project, and start contributing.

I myself am new to open source, and I have been the admin of this project for sometime. Sadly, I haven't been able to give it much time, but I am sure that if we can make a good plan for the future of this project, the quality of the project can be improved. I would LOVE to hear your thoughts on this!
Update - I have created a Gitter chatroom for having discussions regarding the project.

pietermarsman · 2019-10-13T18:30:21Z

Things we could focus on

This is just a list of major things we can do.

Fix bugs and add tests

Most open issues are about bugs in pdfminer (59). Most of them are small, specific and disjoint. It is also a great opportunity to extend the test suite and consequently making it easier to add new features. The goal is to zero known bugs!
Improve documentation for users

Some of the issues are about adding documentation (5) or are actually questions on how to use pdfminer (17). StackOverflow also has many questions on how to use pdfminer. I know from my own experience that using pdfminer is confusing at first. Creating a place with up-to-date documentation for most basic usage scenario's will help enormously. The goal is to make it pleasant for developpers that are new to pdfminer to do basic things with it!
Define stable high-level entry points (i.e. an API)

Currently, pdfminer does not have a stable and easy-to-use API. Most code examples show the use of five or more classes from pdfminer to something simple as extracting text from a pdf. I think that having a set of well-defined functions/classes that can handle most basic tasks, make using pdfminer easier and also more consistent (through time, and by different people). The goal is to create an API that is stable so that we can change the internals with no (or less) people noticing it.
Add new features

Some of the issues are about "enhancing" pdfminer (20), e.g. adding new features or changing default values. I think pdfminer already has many features and that most feature request can wait a while.
Improve code documentation

Almost all code is undocumented and that makes it harder to contribute, especially for newcomers. I think the best way to improve this is to consistently add/improve code documentation for all code that is touched by any PR. The goal is to easily understand the responsibility of each part of pdfminer.
Drop Python2 support

Python2 is no longer supported by the python development team as of january 2020. We should also drop Python2 support at that very moment (Drop python 2 support #194).
Introduce best open-source practices

Like using semver (one day, user semver? #255), keeping the changelog up to date, use git large-file-storage (Start using GIT LFS for binaries #114), automate release process using travis, add code-style enforcement (PEP-8 #92)

Where to start

I think we should focus first on fixing bugs and adding test. The amount of reported bugs is (to) large, some of them are really old, and the test coverage (to) low.

I think we should focus less on major changes until we have fixed most of the bugs. This includes adding new features, dropping python 2 support, improving documentation, creating a stable API, etc.

Since Python2 is no longer supported as of January 2020, I think we should postpone the major changes up till then. In 2019 we can do one or more releases that don't contain any breaking changes and use the old versioning system. These releases will be useful to everybody. With most of the bugs fixed, we can start from the beginning of 2020 to do some major breaking changes, e.g. drop Python2 support, change the versioning system and deprecate/remove some of the old API's.

pietermarsman · 2019-10-27T13:42:18Z

This week I will work on README.md and read-the-docs documentation.

Recursing · 2020-01-06T00:59:51Z

Since python 2 is dead, is there a way to merge this project with pdfminer and pdfminer3, to prevent duplicate work?

pietermarsman · 2020-01-06T21:46:20Z

A quote from @igormp:

we've discussed it before, and sadly it doesn't seem like it's possible. pdfminer3 seems to be abandoned, and euske seems to have no interest in merging both projects.

pietermarsman · 2020-01-21T22:19:28Z

Status update:

Fix bugs and add tests

A lot of bugs where fixed. The CHANGELOG.md lists 15 fixes since 2019-10-20. For most of those bugs tests where added. There are still 27 issues that are labeled as bugs (compared to 59 earlier).

Improve documentation for users

We have a readthedocs now, but it does not contain a lot of examples. We should probably add more.

Define stable high-level entry points (i.e. an API)

The pdf2txt.py and dumppdf.py are always stable. Two functions are added to the high-level api: extract_text() and extract_pages(). These high-level api functions serve many needs but could be more widely used (instead of using the composable api).

Add new features

Since October 2019, 3 new features were added according to the CHANGELOG.md. There are 11 issues that are labelled as enhancement. Still some work to do here...

Improve code documentation

The command-line utilities are better document, the high-level functions got better documentation, but there is still a lot of work to do on all the classes and functions.

Drop Python2 support

Done!

Introduce best open-source practices

We are not going to use semver (because pypi will get untenable confused). Also no progress on git lfs and automatic releasing using travis. On the bright side, the CHANGELOG.md is always up-to-date and code-style enforcement is used.

What to do next

I still think we should focus first on fixing bugs and adding test. The amount of reported bugs is (to) large, some of them are really old.

But I also noticed in the last months that some reported bugs are not actually bugs but rather questions. These issues, and also e.g. questions on stackoverflow, indicate that it is difficult to use pdfminer(.six). So improving the documentation is key.

I think we should focus less on adding new features until we have fixed most of the bugs.

KunalGehlot · 2022-08-08T04:46:38Z

A small addition to what @pietermarsman already mentioned

I think documenting the code and improving read-the-docs is one of the most important things.
After using this library for more than a year, I noticed most people are using the library (including myself) by Frankenstein-ing the code from read-the-docs tutorials and StackOverflow answers without completely understanding what each class/method is doing.

We should start by explaining all the methods and classes used in the Tutorial demos. That will solve many problems people face by relating the errors and their understanding of the library.

Being new to open source and this code base, I'll start contributing by helping with the issues and trying to improve the documentation.

julie777 · 2022-09-16T17:46:24Z

I myself am new to open source, and I have been the admin of this project for sometime. Sadly, I haven't been able to give it much time, but I am sure that if we can make a good plan for the future of this project, the quality of the project can be improved. I would LOVE to hear your thoughts on this! Update - I have created a Gitter chatroom for having discussions regarding the project.

I would much prefer using github discussions instead of a chatroom. That way the discussions are part of the project. The wiki could also be used to capture the results of the discussion and the resulting roadmap.

igormp · 2022-09-19T01:24:13Z

I would much prefer using github discussions instead of a chatroom. That way the discussions are part of the project. The wiki could also be used to capture the results of the discussion and the resulting roadmap.

IIRC, the discussions feature wasn't a thing back then. Seeing how the gitter isn't that active, I guess that would be a good idea in order to properly organize any discussion into threads without needing to search through all of the chatroom history.

vilabho · 2023-02-28T18:57:53Z

I would like to update the current documentation of pdf miner, but whom should I tag for PR approval? it seems this repo is dormant for months... if anybody is maintaining it, please mention them

pietermarsman · 2023-03-03T11:25:53Z

Hi @vilabho,

I'm the dormant maintainer with merge permissions. I've been meaning to do some work last months / year but haven't got to it. Help on proper documentation is very much appreciated.

dhdaines · 2023-10-13T23:19:51Z

IIRC, the discussions feature wasn't a thing back then. Seeing how the gitter isn't that active, I guess that would be a good idea in order to properly organize any discussion into threads without needing to search through all of the chatroom history.

The chatroom doesn't actually seem to exist anymore! So searching its history is no longer an option :(

But more on topic ... I have been submitting PRs to pdfplumber to properly support tagged PDFs that should really be features in pdfminer.six, e.g. jsvine/pdfplumber#961 and jsvine/pdfplumber#963. The reason I haven't done this is that it doesn't appear that pdfminer.six will be maintained at this point, so it doesn't seem worthwhile to put in the extra effort to create a fork/PR that can't actually be depended upon in the foreseeable future.

Is there any possibility that bugfixes, optimizations, and documentation enhancements will be merged at any point soon, let alone new features?

pietermarsman · 2023-10-14T10:40:46Z

Long story short: we are looking for new maintainers

@dhdaines I am sorry that I was not more active in the last years. Unfortunately, I cannot be as active as I was when I started as a maintainer op pdfminer.six. The current situation is much like when I took over from @goulu.

For the future of pdfminer.six it would be very beneficial if we had a maintainer again with time and energy to guide this project. I'm tagging all potential candidates below. But feel free to respond here as well if you are not in the list.

Right now there are 4 owners:

@euske, the original creator of https://github.com/euske/pdfminer
@pudo
@goulu
and me

There are also 5 other members of the pdfminer.six organization:

There are also some people that contributed more than once (all 3 commits or more):

(Have not thought of a procedure for picking a new maintainer yet).

sergei-maertens · 2023-10-14T10:46:29Z

I'm sorry, but I don't use the project anymore nor do I have time to step in. Good luck finding a candidate though!

dhdaines · 2023-10-15T00:04:00Z

Long story short: we are looking for new maintainers

@dhdaines I am sorry that I was not more active in the last years. Unfortunately, I cannot be as active as I was when I started as a maintainer op pdfminer.six. The current situation is much like when I took over from @goulu.

Thank you for the quick reply... as the maintainer of a rather old project I totally understand!

I think the underlying question is whether the project is still relevant enough and used enough to be maintained - I have to admit that I only actually use it via pdfplumber which has a somewhat more Pythonic API while still giving low-level access to the PDF structure.

Since there are a variety of other options for high-level manipulation and text extraction (if you only want text...) from PDFs, I wonder if it would make sense to simply merge the two projects.

NickFabry · 2023-10-20T23:41:21Z

FWIW, I use pdfminer every day; it's still been the only PDF library I've encountered which attempts to account accurately for whitespace and actual page position of text elements in a consistent way. Sometimes the structured data you need is buried in a PDF, and you don't have an alternative source...

I'd love to keep it going, but I don't know if I have the skills to maintain it. A long time ago (15+ years?) I worked a little with @euske on improving PDF miner, so I'm quite fond of it. I'd put up my hand if no body else would.

pettzilla1 · 2023-10-21T12:19:45Z

Hi happy to say it's still incredibly relevant pdfminer.six is incredibly useful for pdf parsing with a permissive license which most other libraries don't have, we still use it daily

dhdaines · 2023-10-22T18:12:13Z

FWIW, I use pdfminer every day; it's still been the only PDF library I've encountered which attempts to account accurately for whitespace and actual page position of text elements in a consistent way. Sometimes the structured data you need is buried in a PDF, and you don't have an alternative source...

Yes, exactly - from what I've seen most PDF libraries work hard to hide the hideous, horrible complexity of the PDF format from you, which is fine if you just want to dump a load of text into a large language model, not so great if you want to use layout information. This, plus pure-Python and permissive license, make pdfminer (and by extension pdfplumber) relevant in my opinion.

I would be willing to help out with maintenance as well. I could also definitely contribute some improvements to documentation and performance.

WolfgangFahl · 2023-11-15T10:52:25Z

The CI is currently broken - this might be the first area of improvement to make sure committing may be done in a way that doesn't break the current state of affairs see https://github.com/pdfminer/pdfminer.six/actions/runs/6793184760. If you invite me i might try some fixes that get the CI working again e.g. simple things such as code formatting.

FriedrichFroebel · 2023-11-15T15:52:12Z

Getting the CI working again should be something which can be ensured on a fork and then submitted as a PR. If there really is some ongoing activity on this repository, merging the CI fixes first from the maintainer side is still possible - no need to directly grant you write permissions.

dhdaines · 2023-11-15T21:54:39Z

Getting the CI working again should be something which can be ensured on a fork and then submitted as a PR. If there really is some ongoing activity on this repository, merging the CI fixes first from the maintainer side is still possible - no need to directly grant you write permissions.

Looks like it's mainly a case of looking for obsolete Python versions on Ubuntu latest. I'll take a look right now on my fork.

dhdaines · 2023-11-15T22:05:53Z

Well, it's a bit more than just Python versions, because there's an unversioned dependency on black among other things. Working on this now here: #921

dhdaines · 2023-11-15T22:47:39Z

And now CI passes: https://github.com/pdfminer/pdfminer.six/actions/runs/6883805593?pr=921

I did this with the minimal amount of code changes, but there are things that will need to be fixed so we can actually use the latest pip and setuptools for instance. They should go in a separate PR.

Now the $921 question! Can someone merge this? @pietermarsman ?

dhdaines · 2023-11-16T14:26:03Z

A secondary PR to also fix building with current pip/setuptools (in Python 3.12): #923

pietermarsman · 2023-11-24T19:17:26Z

Thanks @dhdaines for the work! I merged #921 and will try to look at #923 in the coming days.

pietermarsman · 2023-11-24T19:30:22Z

@WolfgangFahl I'm positive to new contributors, but hesitant to handing out permissions quickly. I see you did not contribute to issues or PR's, that's a great start for any contributor. From your profile it looks like you are an active coder, and we could definitely benefit from your knowledge when triaging issues and PR's.

For PR's I prefer to have at least one review, and not commit directly to master.

WolfgangFahl · 2023-11-25T05:52:02Z

@pietermarsman thanks for looking into my offer again. It was an if sentence and the decision was not to invite me. I accepted that decision.

suryavaddiraju · 2023-12-18T01:31:43Z

A secondary PR to also fix building with current pip/setuptools (in Python 3.12): #923

Yes, I can build setuptools integration and also with new python pypi trusted publishers It's now very easy to build python packages with github workflow automations. But for a change we eliminate setuptools and implement python new standard packaging procedures using hatchling. From now on I will give my hand and support this package and make sure this sets a new standard for python pdf users.

tataganesh added help wanted labels Jun 16, 2018

pietermarsman added the type: discussion label Oct 13, 2019

pietermarsman pinned this issue Oct 13, 2019

pietermarsman removed the announcement label Oct 28, 2019

pietermarsman removed the help wanted label Jan 14, 2020

jsvine mentioned this issue Nov 1, 2021

pdfplumber 0.5.28 requires pdfminer.six==20200517, but you have pdfminer-six 20211012 which is incompatible jsvine/pdfplumber#531

Closed

datatalking mentioned this issue Jul 20, 2022

Type Error during extracting pages in some pdfs #720

Closed

WolfgangFahl mentioned this issue Nov 15, 2023

Next release #915

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Future Road Map #154

Future Road Map #154

tataganesh commented Jun 16, 2018 •

edited

Loading

pietermarsman commented Oct 13, 2019 •

edited

Loading

pietermarsman commented Oct 27, 2019

Recursing commented Jan 6, 2020

pietermarsman commented Jan 6, 2020

pietermarsman commented Jan 21, 2020 •

edited

Loading

KunalGehlot commented Aug 8, 2022

julie777 commented Sep 16, 2022

igormp commented Sep 19, 2022

vilabho commented Feb 28, 2023

pietermarsman commented Mar 3, 2023

dhdaines commented Oct 13, 2023

pietermarsman commented Oct 14, 2023

sergei-maertens commented Oct 14, 2023

dhdaines commented Oct 15, 2023

NickFabry commented Oct 20, 2023 •

edited

Loading

pettzilla1 commented Oct 21, 2023

dhdaines commented Oct 22, 2023

WolfgangFahl commented Nov 15, 2023

FriedrichFroebel commented Nov 15, 2023

dhdaines commented Nov 15, 2023

dhdaines commented Nov 15, 2023

dhdaines commented Nov 15, 2023

dhdaines commented Nov 16, 2023

pietermarsman commented Nov 24, 2023

pietermarsman commented Nov 24, 2023

WolfgangFahl commented Nov 25, 2023

suryavaddiraju commented Dec 18, 2023

Future Road Map #154

Future Road Map #154

Comments

tataganesh commented Jun 16, 2018 • edited Loading

pietermarsman commented Oct 13, 2019 • edited Loading

Things we could focus on

Where to start

pietermarsman commented Oct 27, 2019

Recursing commented Jan 6, 2020

pietermarsman commented Jan 6, 2020

pietermarsman commented Jan 21, 2020 • edited Loading

Status update:

What to do next

KunalGehlot commented Aug 8, 2022

A small addition to what @pietermarsman already mentioned

julie777 commented Sep 16, 2022

igormp commented Sep 19, 2022

vilabho commented Feb 28, 2023

pietermarsman commented Mar 3, 2023

dhdaines commented Oct 13, 2023

pietermarsman commented Oct 14, 2023

sergei-maertens commented Oct 14, 2023

dhdaines commented Oct 15, 2023

NickFabry commented Oct 20, 2023 • edited Loading

pettzilla1 commented Oct 21, 2023

dhdaines commented Oct 22, 2023

WolfgangFahl commented Nov 15, 2023

FriedrichFroebel commented Nov 15, 2023

dhdaines commented Nov 15, 2023

dhdaines commented Nov 15, 2023

dhdaines commented Nov 15, 2023

dhdaines commented Nov 16, 2023

pietermarsman commented Nov 24, 2023

pietermarsman commented Nov 24, 2023

WolfgangFahl commented Nov 25, 2023

suryavaddiraju commented Dec 18, 2023

tataganesh commented Jun 16, 2018 •

edited

Loading

pietermarsman commented Oct 13, 2019 •

edited

Loading

pietermarsman commented Jan 21, 2020 •

edited

Loading

NickFabry commented Oct 20, 2023 •

edited

Loading