Improving Table Performance #26

brianjking · 2024-04-15T23:48:09Z

Initial Checks

I confirm that I'm on the latest version

Description

I'm trying to use the https://filimoa.github.io/open-parse/processing/parsing-tables/unitable/ support to extract content out of a UB-04 document - I added !pip install "openparse[ml]" and !openparse-download to the notebook, but I'm not sure what else is required.

Thanks!

Example Code

No response

The text was updated successfully, but these errors were encountered:

Filimoa · 2024-04-15T23:54:57Z

Could you provide the code you're running?

brianjking · 2024-04-16T00:01:19Z

@Filimoa Sorry, it's your notebook file which I added to.

I'm simply trying to identify the best way to extract text using OpenParse from a document like this:

UB04_empty.pdf

# -*- coding: utf-8 -*-
"""unitable.ipynb

Automatically generated by Colab.

Original file is located at
    https://colab.research.google.com/drive/1Sl-RQTv7Kw_2c2ymr8W_Fpz_HWKdK3v4

## Unitable

When table detection works this yields perfect results for incredibly challenging tables. Unfortunately we still need to use table-transformers for bounding box detection and it's performance leaves something to be desired.

I spoke with the UniTable team and they might implement this. Their challenge lies in PubTables-1M having poor groundtruth annotations.

If you're aware of a better table detection model, please let us know - theoretically this task should be significantly easier than content-extraction.

## Notebook

This notebook demonstrates using unitable to extract some challenging tables we've seen. This is meant to really push the limits of the model.

You will need to use git LFS to download the sample data.
"""

!pip install "openparse[ml]"

!openparse-download

import sys
from pathlib import Path

sys.path.append("..")

import openparse

pdfs_with_tables_dir = Path("/content/pdf")

for pdf_path in pdfs_with_tables_dir.glob("*"):
    parser = openparse.DocumentParser(
        table_args={
            "parsing_algorithm": "unitable",
            "min_table_confidence": 0.8,
        },
        processing_pipeline=None,
    )
    parsed_nodes = parser.parse(pdf_path)
    table_nodes = [node for node in parsed_nodes.nodes if "table" in node.variant]

    if not table_nodes:
        print(f"Could not find tables on {pdf_path}")
        continue

    doc = openparse.Pdf(file=pdf_path)
    doc.display_with_bboxes(table_nodes)

Filimoa · 2024-04-16T00:19:48Z

Thanks, I'll try running this myself soon and add it to the eval suite. If I had to guess the table-transformers detection is performing poorly (unitable does content extraction, table-transformers still does table bbox detection). This is especially a problem on massive full-page tables like this.

brianjking · 2024-04-16T02:23:08Z

@Filimoa So just to be clear, that notebook code I posted should have done the extraction similar to the quickstart and is the SOTA method open parse supports, right?

brianjking · 2024-04-16T02:43:18Z

I added to the quickstart and it tells me there are no tables in the file - which is clearly wrong.

import sys
from pathlib import Path

sys.path.append("..")

import openparse

pdfs_with_tables_dir = Path("/content/pdf")

for pdf_path in pdfs_with_tables_dir.glob("*"):
    parser = openparse.DocumentParser(
        table_args={
            "parsing_algorithm": "unitable",
            "min_table_confidence": 0.8,
        },
        processing_pipeline=None,
    )
    parsed_nodes = parser.parse(pdf_path)
    table_nodes = [node for node in parsed_nodes.nodes if "table" in node.variant]

    if not table_nodes:
        print(f"Could not find tables on {pdf_path}")
        continue

    doc = openparse.Pdf(file=pdf_path)
    doc.display_with_bboxes(table_nodes)
    ```

Filimoa · 2024-04-16T03:59:57Z

I haven't had a chance to test for myself but to clarify yes unitable achieves SOTA performance on converting table images to HTML. Unfortunately it's trained on perfectly cropped tables so we're still forced to rely on table transformers to detect bounding boxes. So if table-transformers doesn't find a table, it won't touch unitable.

I'm actively looking for something with better performance - I haven't had a chance to look into this very deeply, intuitively this seems like a much simpler task than the second stage.

brianjking · 2024-04-16T12:27:34Z

@Filimoa table-transformers sees the table before this.

Filimoa · 2024-04-16T22:42:05Z

I'm working on shipping a llama-index integration and then I will spend some time improving table performance - there's a lot of low hanging fruit here.

TKaluza · 2024-07-28T10:36:22Z

@Filimoa do you know this: https://github.com/huridocs/pdf-document-layout-analysis?tab=readme-ov-file

Filimoa · 2024-07-28T21:10:55Z

@TKaluza No not familiar with it, thanks for dropping the link I'll check it out

brianjking added the bug Something isn't working label Apr 15, 2024

Filimoa changed the title ~~How to use Unitable Notebook?~~ Improving Table Performance Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving Table Performance #26

Improving Table Performance #26

brianjking commented Apr 15, 2024 •

edited

Loading

Filimoa commented Apr 15, 2024

brianjking commented Apr 16, 2024 •

edited

Loading

Filimoa commented Apr 16, 2024 •

edited

Loading

brianjking commented Apr 16, 2024

brianjking commented Apr 16, 2024 •

edited

Loading

Filimoa commented Apr 16, 2024

brianjking commented Apr 16, 2024

Filimoa commented Apr 16, 2024

TKaluza commented Jul 28, 2024

Filimoa commented Jul 28, 2024

Improving Table Performance #26

Improving Table Performance #26

Comments

brianjking commented Apr 15, 2024 • edited Loading

Initial Checks

Description

Example Code

Filimoa commented Apr 15, 2024

brianjking commented Apr 16, 2024 • edited Loading

Filimoa commented Apr 16, 2024 • edited Loading

brianjking commented Apr 16, 2024

brianjking commented Apr 16, 2024 • edited Loading

Filimoa commented Apr 16, 2024

brianjking commented Apr 16, 2024

Filimoa commented Apr 16, 2024

TKaluza commented Jul 28, 2024

Filimoa commented Jul 28, 2024

brianjking commented Apr 15, 2024 •

edited

Loading

brianjking commented Apr 16, 2024 •

edited

Loading

Filimoa commented Apr 16, 2024 •

edited

Loading

brianjking commented Apr 16, 2024 •

edited

Loading