Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watch out for regressions in the Table Detector #11

Open
jazzido opened this issue Feb 25, 2017 · 3 comments
Open

Watch out for regressions in the Table Detector #11

jazzido opened this issue Feb 25, 2017 · 3 comments

Comments

@jazzido
Copy link

jazzido commented Feb 25, 2017

Hey @melisabok,

I just ran the tests in your branch. There's a special test suite for running the table detection algorithms.

Cumulative results for your branch are:

48 out of 67 currently passing
137 out of 156 expected tables detected
34 tables incorrectly detected

master, on the other hand, gives these results:

51 out of 67 currently passing
140 out of 156 expected tables detected
30 tables incorrectly detected

This difference might go away when you make the other tests pass, but it's something to keep an eye out for anyway.

Thanks!

@melisabok
Copy link
Owner

Yes, I'm working on those tests right now. 4/8 that are currently failing come from the TestTableDetection: 10, 26, 35 and 46.

It seems that the horizontalRulings and verticarlRulings have differences and I guess is the way that we generate the images with pdfbox2.0

image = Utils.pageConvertToImage(pdfPage, 144, ImageType.GRAY);

But I'm not sure yet. I'll keep you posted.

@jazzido
Copy link
Author

jazzido commented Feb 25, 2017

Thanks!

You're aware of the Debug tool, right? It might be useful for, well, debugging.

  • Build a mega-jar with mvn clean compile assembly:single
  • Example: java -cp target/tabula-0.9.2-jar-with-dependencies.jar technology.tabula.debug.Debug -p 1 -r src/test/resources/technology/tabula/argentina_diputados_voting_record.pdf
  • That will generate a jpg in the same folder of the input PDF

argentina_diputados_voting_record-1

There's a bunch of options that you can use in the debugger:

java -cp target/tabula-0.9.2-jar-with-dependencies.jar technology.tabula.debug.Debug -h
usage: tabula-debug [-a <AREA>] [-c] [-d] [-e] [-f] [-g] [-h] [-i] [-l]
       [-n] [-p <PAGES>] [-r] [-s] [-t] [-u]
Generate debugging images
 -a,--area <AREA>           Portion of the page to analyze
                            (top,left,bottom,right). Example: --area
                            269.875,12.75,790.5,561. Default is entire
                            page
 -c,--columns               Show columns as detected by
                            BasicExtractionAlgorithm
 -d,--detected-tables       Show detected tables
 -e,--characters            Show detected characters
 -f,--profile               Show projection profile
 -g,--region                Show provided region (-a parameter)
 -h,--help                  Print this help text.
 -i,--intersections         Show intersections between rulings.
 -l,--cells                 Show detected cells
 -n,--clipping-paths        Show clipping paths
 -p,--pages <PAGES>         Comma separated list of ranges, or all.
                            Examples: --pages 1-3,5-7, --pages 3 or
                            --pages all. Default is --pages 1
 -r,--rulings               Show detected rulings.
 -s,--spreadsheets          Show detected spreadsheets.
 -t,--textchunks            Show detected text chunks (merged characters)
 -u,--unprocessed-rulings   Show non-cleaned rulings

@melisabok
Copy link
Owner

This is great! I didn't know about this tool
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants