You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
import fitz # import PyMuPDF
from pymupdf import Page
if not hasattr(fitz.Page, "find_tables"):
raise RuntimeError("This PyMuPDF version does not support the table feature")
def show_image(item, title=""):
"""Display a pixmap.
Just to display Pixmap image of "item" - ignore the man behind the curtain.
Args:
item: any PyMuPDF object having a "get_pixmap" method.
title: a string to be used as image title
Generates an RGB Pixmap from item using a constant DPI and using matplotlib
to show it inline of the notebook.
"""
DPI = 150 # use this resolution
import numpy as np
import matplotlib.pyplot as plt
# %matplotlib inline
pix = item.get_pixmap(dpi=DPI)
img = np.ndarray([pix.h, pix.w, 3], dtype=np.uint8, buffer=pix.samples_mv)
plt.figure(dpi=DPI) # set the figure's DPI
plt.title(title) # set title of image
_ = plt.imshow(img, extent=(0, pix.w * 72 / DPI, pix.h * 72 / DPI, 0))
doc = fitz.open("sample.pdf")
page: Page = doc[0]
tabs = page.find_tables(edge_min_length=50) # detect the tables
for i,tab in enumerate(tabs): # iterate over all tables
for cell in tab.header.cells:
page.draw_rect(cell,color=fitz.pdfcolor["red"],width=0.3)
page.draw_rect(tab.bbox,color=fitz.pdfcolor["green"])
print(f"Table {i} column names: {tab.header.names}, external: {tab.header.external}")
show_image(page, f"Table & Header BBoxes")
PyMuPDF version
1.24.12
Operating system
Windows
Python version
3.9
The text was updated successfully, but these errors were encountered:
In addition to the blue background in the header, the words inside have their own separate / additional background.
As the default detection strategy is to desperately look for any usable information, these word-based backgrounds are also taken into account (there is no color-based check to see if the word background color is any different from other background color).
To change this, specify strategy="lines_strict". This will ignore fill colors and only look at lines.
Description of the bug
When using find_tables, I got redundant rows, columns in merged cell with color background as a result of redundant detected lines.
How can I remove the redundant lines? Please refer to the attached PDF file.
Thank you.
sample.pdf
How to reproduce the bug
PyMuPDF version
1.24.12
Operating system
Windows
Python version
3.9
The text was updated successfully, but these errors were encountered: