Get redundant rows, columns in merged cell with color background #4011

dviettu134 · 2024-11-01T08:41:10Z

Description of the bug

When using find_tables, I got redundant rows, columns in merged cell with color background as a result of redundant detected lines.

How can I remove the redundant lines? Please refer to the attached PDF file.
Thank you.

sample.pdf

How to reproduce the bug

import fitz  # import PyMuPDF
from pymupdf import Page
if not hasattr(fitz.Page, "find_tables"):
    raise RuntimeError("This PyMuPDF version does not support the table feature")

def show_image(item, title=""):
    """Display a pixmap.

    Just to display Pixmap image of "item" - ignore the man behind the curtain.

    Args:
        item: any PyMuPDF object having a "get_pixmap" method.
        title: a string to be used as image title

    Generates an RGB Pixmap from item using a constant DPI and using matplotlib
    to show it inline of the notebook.
    """
    DPI = 150  # use this resolution
    import numpy as np
    import matplotlib.pyplot as plt

    # %matplotlib inline
    pix = item.get_pixmap(dpi=DPI)
    img = np.ndarray([pix.h, pix.w, 3], dtype=np.uint8, buffer=pix.samples_mv)
    plt.figure(dpi=DPI)  # set the figure's DPI
    plt.title(title)  # set title of image
    _ = plt.imshow(img, extent=(0, pix.w * 72 / DPI, pix.h * 72 / DPI, 0))

doc = fitz.open("sample.pdf")
page: Page = doc[0]

tabs = page.find_tables(edge_min_length=50)  # detect the tables
for i,tab in enumerate(tabs):  # iterate over all tables
    for cell in tab.header.cells:
        page.draw_rect(cell,color=fitz.pdfcolor["red"],width=0.3)
    page.draw_rect(tab.bbox,color=fitz.pdfcolor["green"])
    print(f"Table {i} column names: {tab.header.names}, external: {tab.header.external}")
    
show_image(page, f"Table & Header BBoxes")

PyMuPDF version

1.24.12

Operating system

Windows

Python version

3.9

The text was updated successfully, but these errors were encountered:

JorjMcKie · 2024-11-01T12:06:10Z

In addition to the blue background in the header, the words inside have their own separate / additional background.
As the default detection strategy is to desperately look for any usable information, these word-based backgrounds are also taken into account (there is no color-based check to see if the word background color is any different from other background color).
To change this, specify strategy="lines_strict". This will ignore fill colors and only look at lines.

JorjMcKie added the not a bug not a bug / user error / unable to reproduce label Nov 1, 2024

JorjMcKie closed this as completed Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get redundant rows, columns in merged cell with color background #4011

Get redundant rows, columns in merged cell with color background #4011

dviettu134 commented Nov 1, 2024

JorjMcKie commented Nov 1, 2024

Get redundant rows, columns in merged cell with color background #4011

Get redundant rows, columns in merged cell with color background #4011

Comments

dviettu134 commented Nov 1, 2024

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

JorjMcKie commented Nov 1, 2024