Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding logic to trim pages that are too large to process #211

Merged
merged 1 commit into from
Nov 22, 2024

Conversation

jordan-homan
Copy link
Contributor

@jordan-homan jordan-homan commented Nov 21, 2024

Notes

Improves client logic when a PDF page is very long: trims the x/y coordinates down to a reasonable size (hi-res only). Note: this does not affect output of text: the reader is still able to process the entire page for text.

Testing

Manually tested changes on large file. Added integration test verifying large pages now process successfully.

@jordan-homan jordan-homan force-pushed the add_page_split_logic_pdf branch 3 times, most recently from 3608682 to 3e15249 Compare November 21, 2024 16:20
@jordan-homan jordan-homan marked this pull request as ready for review November 21, 2024 17:04
@jordan-homan jordan-homan changed the title adding logic to split pages that are too large to process adding logic to trim pages that are too large to process Nov 22, 2024
Copy link

@Klaijan Klaijan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran the test by locally pip install -v -e . the checked out PR.

INFO: HTTP Request: GET https://api.unstructuredapp.io/general/docs "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
Runtime type is 'ModelMetaclass'
{'type': 'UncategorizedText', 'element_id': '0607d9a606c4a0d5355c730cea79e38a', 'text': '🔥', 'metadata': {'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 1, 'filename': 'super_long_pages.pdf'}}

@jordan-homan jordan-homan merged commit 2082d4f into main Nov 22, 2024
13 checks passed
@jordan-homan jordan-homan deleted the add_page_split_logic_pdf branch November 22, 2024 19:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants