PDF table.box is inaccurate? #218

grahama1970 · 2024-09-23T13:06:08Z

Hi. I'm trying to get some kind of bounding box alignment between the PDF (text extraction) method below and PyMuPDF's bounding boxes.
The Img2TableImage module's bounding box is reasonably accurate and can be correlated to PyMuPDF's bounding box.
The PDF bounding box is off.
Is this a known issue, or is there a work-around?

PyMuPDF bounding box: (72.0375, 72.0625, 540.4875, 561.0)
Image2Table Bounding Box (PDF module): (201, 201, 1503, 1328)

Much appreciation in advance

Extra for debugging:

Image2Table using the PDF (text extraction) module.

# Extract tables
extracted_tables = pdf.extract_tables(ocr=tesseract_ocr,
                                      implicit_rows=False,
                                      borderless_tables=False,
                                      min_confidence=50)

extracted_tables

Extracted Image2Table table is:
bbox = (201, 201, 1503, 1328)

PyMuPDF:

doc = fitz.open(pdf_path)
for page_num in range(1, len(doc)):
    tabs = doc[page_num].find_tables()  # detect the tables
    
    # print(page_num, tabs)
    print(doc[page_num].rect.height)
    for i, tab in enumerate(tabs):  # iterate over all tables
        for cell in tab.header.cells:
            doc[page_num].draw_rect(cell,color=fitz.pdfcolor["red"],width=0.3)
        print(f"  Table bbox: {tab.bbox}")
        doc[page_num].draw_rect(tab.bbox,color=fitz.pdfcolor["green"])
        print(f"Table {i} column names: {tab.header.names}, external: {tab.header.external}")

extracted table with PymuPDF is:
bbox = (72.0375, 72.0625, 540.4875, 561.0)

The text was updated successfully, but these errors were encountered:

xavctn · 2024-09-23T13:14:19Z

Hello,

As mentionned in the documentation, when processing PDFs, all pages are converted to images using a DPI of 200.
The table coordinates returned by the library correspond to this image.

When using PyMuPDF, the coordinates returned are the one corresponding to the PDF page mediabox.

Here is an example of how I am handling the relationship/conversion between those 2 sets of coordinates.

Hope it helps.

grahama1970 · 2024-09-23T13:45:59Z

It does. Thank you :)

from img2table.document import PDF
from img2table.ocr import TesseractOCR
tesseract_ocr = TesseractOCR(n_threads=1, lang="eng")
pdf_path = '/path/to/pdf'

pdf = PDF(src=pdf_path)

extracted_tables = pdf.extract_tables(ocr=tesseract_ocr,
                                      implicit_rows=False,
                                      borderless_tables=False,
                                      min_confidence=50)

target_dpi = 72
original_dpi = 200
for page, tables in extracted_tables.items():
    for idx, table in enumerate(tables):
        print(page, idx)
        
        original_bbox_dict = {"x1": table.bbox.x1, "y1": table.bbox.y1, "x2": table.bbox.x2, "y2": table.bbox.y2}
        
        pymupdf_bbox_dict = {
            "x1": (table.bbox.x1 * target_dpi) / original_dpi,
            "y1": (table.bbox.y1 * target_dpi) / original_dpi,
            "x2": (table.bbox.x2 * target_dpi) / original_dpi,
            "y2": (table.bbox.y2 * target_dpi) / original_dpi
        }

        print(f'original_bbox_dict: {original_bbox_dict}')
        print(f'pymupdf_bbox_dict: {pymupdf_bbox_dict}')

Result (Accurate):

pymupdf_bbox_dict: {'x1': 72.36, 'y1': 72.36, 'x2': 541.08, 'y2': 478.08}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF table.box is inaccurate? #218

PDF table.box is inaccurate? #218

grahama1970 commented Sep 23, 2024 •

edited

Loading

xavctn commented Sep 23, 2024 •

edited

Loading

grahama1970 commented Sep 23, 2024 •

edited

Loading

PDF table.box is inaccurate? #218

PDF table.box is inaccurate? #218

Comments

grahama1970 commented Sep 23, 2024 • edited Loading

xavctn commented Sep 23, 2024 • edited Loading

grahama1970 commented Sep 23, 2024 • edited Loading

grahama1970 commented Sep 23, 2024 •

edited

Loading

xavctn commented Sep 23, 2024 •

edited

Loading

grahama1970 commented Sep 23, 2024 •

edited

Loading