Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF table.box is inaccurate? #218

Open
grahama1970 opened this issue Sep 23, 2024 · 2 comments
Open

PDF table.box is inaccurate? #218

grahama1970 opened this issue Sep 23, 2024 · 2 comments

Comments

@grahama1970
Copy link

grahama1970 commented Sep 23, 2024

Hi. I'm trying to get some kind of bounding box alignment between the PDF (text extraction) method below and PyMuPDF's bounding boxes.
The Img2TableImage module's bounding box is reasonably accurate and can be correlated to PyMuPDF's bounding box.
The PDF bounding box is off.
Is this a known issue, or is there a work-around?

PyMuPDF bounding box: (72.0375, 72.0625, 540.4875, 561.0)
Image2Table Bounding Box (PDF module): (201, 201, 1503, 1328)

Much appreciation in advance

Extra for debugging:

Image2Table using the PDF (text extraction) module.

# Extract tables
extracted_tables = pdf.extract_tables(ocr=tesseract_ocr,
                                      implicit_rows=False,
                                      borderless_tables=False,
                                      min_confidence=50)

extracted_tables

Extracted Image2Table table is:
bbox = (201, 201, 1503, 1328)

PyMuPDF:

doc = fitz.open(pdf_path)
for page_num in range(1, len(doc)):
    tabs = doc[page_num].find_tables()  # detect the tables
    
    # print(page_num, tabs)
    print(doc[page_num].rect.height)
    for i, tab in enumerate(tabs):  # iterate over all tables
        for cell in tab.header.cells:
            doc[page_num].draw_rect(cell,color=fitz.pdfcolor["red"],width=0.3)
        print(f"  Table bbox: {tab.bbox}")
        doc[page_num].draw_rect(tab.bbox,color=fitz.pdfcolor["green"])
        print(f"Table {i} column names: {tab.header.names}, external: {tab.header.external}")

extracted table with PymuPDF is:
bbox = (72.0375, 72.0625, 540.4875, 561.0)

@xavctn
Copy link
Owner

xavctn commented Sep 23, 2024

Hello,

As mentionned in the documentation, when processing PDFs, all pages are converted to images using a DPI of 200.
The table coordinates returned by the library correspond to this image.

When using PyMuPDF, the coordinates returned are the one corresponding to the PDF page mediabox.

Here is an example of how I am handling the relationship/conversion between those 2 sets of coordinates.

Hope it helps.

@grahama1970
Copy link
Author

grahama1970 commented Sep 23, 2024

It does. Thank you :)

from img2table.document import PDF
from img2table.ocr import TesseractOCR
tesseract_ocr = TesseractOCR(n_threads=1, lang="eng")
pdf_path = '/path/to/pdf'

pdf = PDF(src=pdf_path)

extracted_tables = pdf.extract_tables(ocr=tesseract_ocr,
                                      implicit_rows=False,
                                      borderless_tables=False,
                                      min_confidence=50)

target_dpi = 72
original_dpi = 200
for page, tables in extracted_tables.items():
    for idx, table in enumerate(tables):
        print(page, idx)
        
        original_bbox_dict = {"x1": table.bbox.x1, "y1": table.bbox.y1, "x2": table.bbox.x2, "y2": table.bbox.y2}
        
        pymupdf_bbox_dict = {
            "x1": (table.bbox.x1 * target_dpi) / original_dpi,
            "y1": (table.bbox.y1 * target_dpi) / original_dpi,
            "x2": (table.bbox.x2 * target_dpi) / original_dpi,
            "y2": (table.bbox.y2 * target_dpi) / original_dpi
        }

        print(f'original_bbox_dict: {original_bbox_dict}')
        print(f'pymupdf_bbox_dict: {pymupdf_bbox_dict}')

Result (Accurate):

pymupdf_bbox_dict: {'x1': 72.36, 'y1': 72.36, 'x2': 541.08, 'y2': 478.08}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants