Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestTableDetection.[35]: Expected one table and detected two #14

Open
melisabok opened this issue Mar 3, 2017 · 2 comments
Open

TestTableDetection.[35]: Expected one table and detected two #14

melisabok opened this issue Mar 3, 2017 · 2 comments

Comments

@melisabok
Copy link
Owner

melisabok commented Mar 3, 2017

File us-009.pdf

New code is detecting more rulings that the old code.

New rulings:
new_with_text

Old rulings:
old_with_text

That's why is detecting 2 tables instead of 1, see images:

New:
us-009-1

Old:
us-009-1

I think it is ok to detect 2 tables, what should we do in this case?

@melisabok melisabok changed the title TestTableDetection.[35] Expected one table and detected two TestTableDetection.[35]: Expected one table and detected two Mar 3, 2017
@jazzido
Copy link

jazzido commented Mar 3, 2017

That's an interesting side effect of the improvements in PDFBox 2.0: the old version missed some lines.

Also, we've run into this case before. Sometimes, the table detection algorithm picks up two "tables", one contained inside the other. Unfortunately, we haven't arrived to a decision on what to do. My guess is that we should build a tree of rectangles (using containedIn as the linkage criteria) and keep the outermost element. @jeremybmerrill any ideas?

@melisabok
Copy link
Owner Author

melisabok commented Mar 5, 2017

I found the comparator in the NurminenDetectionAlgorithm and I made a fix to make the tests pass.

I'm not sure if this is the right solution, because this comparator doesn't ensure that the TreeSet keeps the outermost table, this depends of the order of the tables that you send in the addAll:

tableSet.addAll(tableAreas);

With this fix all the TestTableDetection tests are passing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants