Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new recap PDF extraction endpoint #190

Merged
merged 22 commits into from
May 30, 2024
Merged

Add new recap PDF extraction endpoint #190

merged 22 commits into from
May 30, 2024

Conversation

flooie
Copy link
Collaborator

@flooie flooie commented May 28, 2024

@mlissner

This PR adds a new api endpoint specifically for RECAP Documents.

Instead of versioning the document extraction we streamlined a new API endpoint for RECAP documents.
They don't come as RTF or Txt or mp3 so it just handles PDFs.
This endpoint also drops ocr_available flag as it will no longer be useful

@flooie flooie changed the title Add recap extraction Add new recap PDF extraction endpoint May 28, 2024
@flooie flooie requested a review from mlissner May 28, 2024 21:42
Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. I went back and looked at the comments in #187, and tried to move them here. There's a handful that still aren't resolved.

I would have preferred to have that PR be continued instead of opening a new one because this way I have to do an entirely new review and move things over. It takes a lot more time.

Anyhow, that's done and we've got some comments in the attached. Mostly we still need explanations of heuristic things you're doing and some tests would be helpful too.

The only other thing is that with the new endpoint, we need a bit of documentation:

  • Can you please add something to the readme?

doctor/tasks.py Outdated Show resolved Hide resolved
doctor/lib/text_extraction.py Outdated Show resolved Hide resolved
doctor/lib/text_extraction.py Show resolved Hide resolved
doctor/lib/text_extraction.py Outdated Show resolved Hide resolved
doctor/lib/text_extraction.py Outdated Show resolved Hide resolved
doctor/lib/text_extraction.py Show resolved Hide resolved
doctor/lib/text_extraction.py Outdated Show resolved Hide resolved
doctor/lib/text_extraction.py Outdated Show resolved Hide resolved
doctor/lib/text_extraction.py Show resolved Hide resolved
doctor/lib/text_extraction.py Show resolved Hide resolved
@flooie
Copy link
Collaborator Author

flooie commented May 29, 2024

@mlissner a second pass would be appreciated

@flooie flooie requested a review from mlissner May 29, 2024 19:55
Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, now we're on the homestretch. I found a few little tweaks, and we still need an update to the readme, but nothing substantive.

The tests are a huge help to both understanding the code and making sure it doesn't regress. Thanks!

doctor/lib/text_extraction.py Outdated Show resolved Hide resolved
doctor/lib/text_extraction.py Outdated Show resolved Hide resolved
doctor/lib/text_extraction.py Outdated Show resolved Hide resolved
doctor/lib/text_extraction.py Outdated Show resolved Hide resolved
doctor/lib/text_extraction.py Show resolved Hide resolved
doctor/lib/text_extraction.py Outdated Show resolved Hide resolved
doctor/lib/text_extraction.py Outdated Show resolved Hide resolved
flooie and others added 3 commits May 30, 2024 11:03
Small tweaks to doc strings
Update func names
slight refactor of adjust caption lines
@flooie
Copy link
Collaborator Author

flooie commented May 30, 2024

Thanks for the comments @mlissner - back to you

Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the follow through as always!

:shipit:

@mlissner mlissner merged commit 5f30530 into main May 30, 2024
10 checks passed
@mlissner mlissner deleted the add-recap-extraction branch May 30, 2024 16:20
Copy link

sentry-io bot commented Jun 5, 2024

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

  • ‼️ ValueError: Bounding box (0, 144.0, 1224, 1440.0) is not fully within parent page bounding box (0, 0, 1224, 792) /extract/recap/text/ View Issue
  • ‼️ PSEOF: Unexpected EOF /extract/recap/text/ View Issue
  • ‼️ PDFSyntaxError: No /Root object! - Is this really a PDF? /extract/recap/text/ View Issue

Did you find this useful? React with a 👍 or 👎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants