Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JPGs and PNGs images in the PDF #84

Open
1 task done
tulas75 opened this issue Nov 14, 2024 · 0 comments
Open
1 task done

JPGs and PNGs images in the PDF #84

tulas75 opened this issue Nov 14, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@tulas75
Copy link

tulas75 commented Nov 14, 2024

Initial Checks

  • I confirm that I'm on the latest version

Description

It seems that there's a problem if the pdf file contains both pngs and jpgs. In these case it seems that pngs cannot be detected. Here's below an example pdf file.
op.pdf

Here's below the output of the sample code

a8813dd5-1d88-4c42-9a0f-9a8d7149d3b6
['text']
List of metropolitan areas in Europe
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

6f2a69c7-33a3-414e-aeba-20a03b78fc68
['text']
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

e43c1deb-d81e-4685-a773-89ee5368f65e
['image', 'text']
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum. survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
Number of chunks: 3

Example Code

import openparse
import json
basic_doc_path = "op.pdf"
parser = openparse.DocumentParser(
     table_args={
     "parsing_algorithm": "pymupdf",
     "table_output_format": "markdown"
    }
)
parsed_basic_doc = parser.parse(basic_doc_path)

chunks = parsed_basic_doc.model_dump_json()
chunks = json.loads(chunks)
for node in chunks['nodes']:
    print(node['node_id'])
    print(node['variant'])
    print(node['text'])

print('Number of chunks:', len(parsed_basic_doc.nodes))

Python, open-parse & OS Version

python_version: 3.11.2
             operating_system: Linux
                   os_version: 6.1.0-27-amd64
           open-parse version: 0.7.0
                 install path: /home/tulas/Projects/tmpop/env/lib/python3.11/site-packages/openparse
               python version: 3.11.2 (main, Sep 14 2024, 03:00:30) [GCC 12.2.0]
                     platform: Linux-6.1.0-27-amd64-x86_64-with-glibc2.36
             related packages: pydantic-2.9.2 PyMuPDF-1.24.13 tokenizers-0.19.1
@tulas75 tulas75 added the bug Something isn't working label Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant