Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new recap PDF extraction endpoint #190

Merged
merged 22 commits into from
May 30, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
35c25d2
feat(text_extraction): Add text extraction api
flooie May 28, 2024
bd2c7cd
tests(recap) Add recap tests for new endpoint
flooie May 28, 2024
c4514eb
docs(DEVELOPMENT) Fix doc docker call
flooie May 28, 2024
5cf0869
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 28, 2024
b02b7aa
fix(text): rename deskew to is_skewed
flooie May 29, 2024
0078817
Merge branch 'add-recap-extraction' of https://github.com/freelawproj…
flooie May 29, 2024
bd3aace
fix(text): Update docstrings ocr image to data
flooie May 29, 2024
c246bef
fix(text_extraction): Explain get_word
flooie May 29, 2024
f9c0b3d
fix(text_extraction): Update formatting and docstrings
flooie May 29, 2024
8d2dcbf
feat(tasks): Drop mojibake fix as unlikely to be needed
flooie May 29, 2024
6d7fe01
fix(adjust_caption): Update adjust caption
flooie May 29, 2024
0b30bb9
tests(text_extraction): Add unit tests for new methods
flooie May 29, 2024
1078bb9
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 29, 2024
0a2cd5e
test(caption Adjustment): Add new test class
flooie May 29, 2024
098ef99
Merge branch 'add-recap-extraction' of https://github.com/freelawproj…
flooie May 29, 2024
9c1c44c
test(workflows) Add v3.11 and v3.12 to tests
flooie May 29, 2024
7a29189
test(adjustment) Add fix for test
flooie May 29, 2024
0d12f95
docs(readme) Add endpoint updates
flooie May 30, 2024
5d19f57
fix(text_extract) Updates from PR
flooie May 30, 2024
5801d2d
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 30, 2024
979adf3
fix(text): Fix variable value
flooie May 30, 2024
47d4a04
fix(tests): Remove print in tests
flooie May 30, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.10"]
python-version: ["3.10", "3.11", "3.12"]
steps:
- uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
Expand Down
2 changes: 1 addition & 1 deletion DEVELOPING.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ If you want to see debug logs, set `DEBUG` to `True` in `settings.py`.
Once the above compose file is running, you can use the `mock_web_app`
container to run the tests against the `doctor` container:

docker exec -it mock_web_app_doctor python3 -m unittest doctor.tests
docker exec -it mock_web_app python3 -m unittest doctor.tests


## Building Images
Expand Down
19 changes: 19 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,25 @@ Valid requests will receive a JSON response with the following keys:
- `extracted_by_ocr`: Whether OCR was needed and used during processing.
- `page_count`: The number of pages, if it applies.

### Endpoint: /extract/recap/text/

Given a RECAP pdf, extract out the text using PDF Plumber, OCR or a combination of the two

Parameters:

- `strip_margin`: Whether doctor should crop the edges of the recap document during processing. With PDF plumber it will ignore traditional 1 inch margin. With an OCR it lowers the threshold for hiding OCR gibberish. To enable it, set strip_margin to `True`:

```bash
curl 'http://localhost:5050/extract/recap/text/?strip_margin=True' \
-X 'POST' \
-F "file=@doctor/recap_extract/gov.uscourts.cacd.652774.40.0.pdf"
```

Valid requests will receive a JSON response with the following keys:

- `content`: The utf-8 encoded text of the file
- `extracted_by_ocr`: Whether OCR was needed and used during processing.


## Utilities

Expand Down
1 change: 1 addition & 0 deletions doctor/forms.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,7 @@ def clean(self):
class DocumentForm(BaseFileForm):
ocr_available = forms.BooleanField(label="ocr-available", required=False)
mime = forms.BooleanField(label="mime", required=False)
strip_margin = forms.BooleanField(label="strip-margin", required=False)

def clean(self):
self.clean_file()
Expand Down
Loading
Loading