Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfixes in utils.split_textline, core.Table.to_excel + various linting changes #72

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

matt-dies-tenet3
Copy link

@matt-dies-tenet3 matt-dies-tenet3 commented Jul 12, 2023

Hi there @foarsitter!

First off, thanks for all the work you've done on camelot-fork, as well as helping to revive camelot. I read through this issue where I saw that you were added a maintainer to the project and are working to merge your changes in from this project to enable Python3.9+ compatibility. With that being said, I've used camelot-fork a bit and have noticed some bugs alongside @josht3.

Below is documentation on each of the changes:

  1. core.py:
    • Table.to_excel method bug-fix: With the installed version of pandas in camelot-fork, the pd.ExcelWriter object no longer has a write method. I've updated this method accordingly to write the file in the recommended manner.
    • Linting changes: In my personal usage, I found the type hinting of the original camelot package left a lot to be desired. My personal use case made the Table and TableList objects most important, so I've added type hinting to these methods that enable type checkers and linters to provide useful hints when using these objects or functions that return them. For example, I used to get linting errors about iterating through a TableList, whereas now I don't! The linter is also able to tell that the objects within the iterable are Tables. Just a small QOL change that went a long way in my use-case, and likely will for others.
  2. lattice.py:
    • Ensuring indices has elements: This is a small but mighty change! We found errors consistently at this line. After returning from lattice.get_table_index, there is no check for whether the indices list has any elements. This slight modification does the indices reduction if the indices list has no elements.
    • Import organization: See below.
  3. utils.py:
    • Ensuring cut_text elements remain "in bounds" of table in split_textline: The two changes here are for the horizontal and vertical axes within the table. The horizontal change ensures that the maximum cut point is the furthest right column, so it can no longer expand beyond the bounds of the table to the right. The vertical change ensures that the minimum cut point is 0, so it can no longer expand beyond the bounds of the table above it.
    • Import organization: See below.
  4. handlers.py, io.py: Just organization of imports. These files were modified automatically by my import sorter. It doesn't particularly hurt or help at all, except maybe improving readability.

@matt-dies-tenet3
Copy link
Author

matt-dies-tenet3 commented Jul 12, 2023

Looks like these tests are failing because there is no sudo apt update preceding the sudo apt install ghostscript command in tests.yml. I'm going to add it and re-submit, one sec.

This change should fix it according to this link.

@foarsitter
Copy link
Owner

Thanks for your pull-request @matt-dies-tenet3

As long the tests are succeeding, the coverage is increasing and the readability is improving I see no reason to decline your pull-request. We are not dealing with an ideal situation so we can bent the rules.

Keep sending them in and if you want I can even make you maintainer.

@foarsitter foarsitter self-requested a review July 13, 2023 08:20
@matt-dies-tenet3 matt-dies-tenet3 marked this pull request as ready for review August 23, 2023 13:35
@matt-dies-tenet3
Copy link
Author

Hey again @foarsitter!

Sorry for the delay on marking the PR as ready. Had to jump through a few hoops. Everything should be good for review now.

@bosd
Copy link

bosd commented Aug 11, 2024

@foarsitter Is this PR a candidate to get opened and merged in the new repo?
cc: @matt-dies-tenet3
https://github.com/py-pdf/pypdf_table_extraction

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants