No nodes are extracted from some PDFs #85

faileon · 2024-11-16T21:53:30Z

Initial Checks

I confirm that I'm on the latest version

Description

I've noticed that when I split my PDF via Firefox to have a smaller PDF (e.g. first 10 pages), openparse wont extract any nodes. Original PDF gets extracted fine.

When I specify table_args, it will make parser return some nodes, but all are identified as a table.

I am attaching the PDF, perhaps someone could have a look what's wrong.
concept-vp4360-cz.pdf

Example Code

No response

Python, open-parse & OS Version

python_version: 3.12.7
operating_system: Linux
os_version: 6.11.8-arch1-2
open-parse version: 0.7.0
python version: 3.12.7 (main, Oct  1 2024, 11:15:50) [GCC 14.2.1 20240910]
platform: Linux-6.11.8-arch1-2-x86_64-with-glibc2.40
related packages: torchvision-0.20.1 tokenizers-0.20.3 torch-2.5.1 pydantic-2.9.2 PyMuPDF-1.24.13 transformers-4.46.2

The text was updated successfully, but these errors were encountered:

faileon added the bug Something isn't working label Nov 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No nodes are extracted from some PDFs #85

No nodes are extracted from some PDFs #85

faileon commented Nov 16, 2024

No nodes are extracted from some PDFs #85

No nodes are extracted from some PDFs #85

Comments

faileon commented Nov 16, 2024

Initial Checks

Description

Example Code

Python, open-parse & OS Version