Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get redundant rows, columns in merged cell with color background #4011

Closed
dviettu134 opened this issue Nov 1, 2024 · 1 comment
Closed
Labels
not a bug not a bug / user error / unable to reproduce

Comments

@dviettu134
Copy link

Description of the bug

When using find_tables, I got redundant rows, columns in merged cell with color background as a result of redundant detected lines.

image

image

How can I remove the redundant lines? Please refer to the attached PDF file.
Thank you.

sample.pdf

How to reproduce the bug

import fitz  # import PyMuPDF
from pymupdf import Page
if not hasattr(fitz.Page, "find_tables"):
    raise RuntimeError("This PyMuPDF version does not support the table feature")

def show_image(item, title=""):
    """Display a pixmap.

    Just to display Pixmap image of "item" - ignore the man behind the curtain.

    Args:
        item: any PyMuPDF object having a "get_pixmap" method.
        title: a string to be used as image title

    Generates an RGB Pixmap from item using a constant DPI and using matplotlib
    to show it inline of the notebook.
    """
    DPI = 150  # use this resolution
    import numpy as np
    import matplotlib.pyplot as plt

    # %matplotlib inline
    pix = item.get_pixmap(dpi=DPI)
    img = np.ndarray([pix.h, pix.w, 3], dtype=np.uint8, buffer=pix.samples_mv)
    plt.figure(dpi=DPI)  # set the figure's DPI
    plt.title(title)  # set title of image
    _ = plt.imshow(img, extent=(0, pix.w * 72 / DPI, pix.h * 72 / DPI, 0))

doc = fitz.open("sample.pdf")
page: Page = doc[0]

tabs = page.find_tables(edge_min_length=50)  # detect the tables
for i,tab in enumerate(tabs):  # iterate over all tables
    for cell in tab.header.cells:
        page.draw_rect(cell,color=fitz.pdfcolor["red"],width=0.3)
    page.draw_rect(tab.bbox,color=fitz.pdfcolor["green"])
    print(f"Table {i} column names: {tab.header.names}, external: {tab.header.external}")
    
show_image(page, f"Table & Header BBoxes")

PyMuPDF version

1.24.12

Operating system

Windows

Python version

3.9

@JorjMcKie JorjMcKie added the not a bug not a bug / user error / unable to reproduce label Nov 1, 2024
@JorjMcKie
Copy link
Collaborator

In addition to the blue background in the header, the words inside have their own separate / additional background.
As the default detection strategy is to desperately look for any usable information, these word-based backgrounds are also taken into account (there is no color-based check to see if the word background color is any different from other background color).
To change this, specify strategy="lines_strict". This will ignore fill colors and only look at lines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
not a bug not a bug / user error / unable to reproduce
Projects
None yet
Development

No branches or pull requests

2 participants