-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDF table.box is inaccurate? #218
Comments
Hello, As mentionned in the documentation, when processing PDFs, all pages are converted to images using a DPI of 200. When using PyMuPDF, the coordinates returned are the one corresponding to the PDF page mediabox. Here is an example of how I am handling the relationship/conversion between those 2 sets of coordinates. Hope it helps. |
It does. Thank you :)
Result (Accurate):
|
Hi. I'm trying to get some kind of bounding box alignment between the PDF (text extraction) method below and PyMuPDF's bounding boxes.
The Img2TableImage module's bounding box is reasonably accurate and can be correlated to PyMuPDF's bounding box.
The PDF bounding box is off.
Is this a known issue, or is there a work-around?
PyMuPDF bounding box: (72.0375, 72.0625, 540.4875, 561.0)
Image2Table Bounding Box (PDF module): (201, 201, 1503, 1328)
Much appreciation in advance
Extra for debugging:
Image2Table using the PDF (text extraction) module.
Extracted Image2Table table is:
bbox = (201, 201, 1503, 1328)
PyMuPDF:
extracted table with PymuPDF is:
bbox = (72.0375, 72.0625, 540.4875, 561.0)
The text was updated successfully, but these errors were encountered: