Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index error on Hybrid Parser #252

Open
bosd opened this issue Nov 1, 2024 · 0 comments
Open

Index error on Hybrid Parser #252

bosd opened this issue Nov 1, 2024 · 0 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@bosd
Copy link
Collaborator

bosd commented Nov 1, 2024

Describe the bug

In some cases, there is an index error while using the Hybrid parser on a multipage pdf.
It is described and tested in #251
What is merged there is rather a workaround then a fix.
A it now fails gracefully.

Steps to reproduce the bug

See

def test_network_no_infinite_execution(testdir):
"""Test for not infinite execution.
This test used to fail, because the network parse was'nt able to process the tables on this pages.
After a refactor it stops infinite execution. But parsing result could be improved.
Hence this is no qualitative test.
"""
filename = os.path.join(testdir, "tabula/schools.pdf")
tables = camelot.read_pdf(
filename, flavor="network", backend="ghostscript", pages="4"
)
assert len(tables) >= 1

Expected behavior

Potential better fix would be to re-assemble the parts of the table detcted by the netwerk parser into the hybrid parser.
That part of the code also contained a TODO note from the original author.

def _generate_columns_and_rows(self, bbox, user_cols):
# select elements which lie within table_bbox
self.t_bbox = text_in_bbox_per_axis(
bbox, self.horizontal_text, self.vertical_text
)
all_tls = list(
sorted(
filter(
lambda textline: len(textline.get_text().strip()) > 0,
self.t_bbox["horizontal"] + self.t_bbox["vertical"],
),
key=lambda textline: (-textline.y0, textline.x0),
)
)
text_x_min, text_y_min, text_x_max, text_y_max = bbox_from_textlines(all_tls)
# FRHTODO:
# This algorithm takes the horizontal textlines in the bbox, and groups
# them into rows based on their bottom y0.
# That's wrong: it misses the vertical items, and misses out on all
# the alignment identification work we've done earlier.
rows_grouped = self._group_rows(all_tls, row_tol=self.row_tol)
rows = self._join_rows(rows_grouped, text_y_max, text_y_min)

PDF

tabula/schools.pdf

Screenshots

@bosd bosd added bug Something isn't working help wanted Extra attention is needed labels Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant