freelawproject · mlissner · May 30, 2024 · May 28, 2024 · May 28, 2024 · May 28, 2024
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -13,7 +13,7 @@ jobs:
  runs-on: ubuntu-latest
  strategy:
  matrix:
- python-version: ["3.10"]
+ python-version: ["3.10", "3.11", "3.12"]
  steps:
  - uses: actions/checkout@v2
  - name: Set up Python ${{ matrix.python-version }}

diff --git a/DEVELOPING.md b/DEVELOPING.md
@@ -19,7 +19,7 @@ If you want to see debug logs, set `DEBUG` to `True` in `settings.py`.
 Once the above compose file is running, you can use the `mock_web_app`
 container to run the tests against the `doctor` container:
 
- docker exec -it mock_web_app_doctor python3 -m unittest doctor.tests
+ docker exec -it mock_web_app python3 -m unittest doctor.tests
 
 
 ## Building Images

diff --git a/README.md b/README.md
@@ -100,6 +100,25 @@ Valid requests will receive a JSON response with the following keys:
  - `extracted_by_ocr`: Whether OCR was needed and used during processing.
  - `page_count`: The number of pages, if it applies.
 
+### Endpoint: /extract/recap/text/
+
+Given a RECAP pdf, extract out the text using PDF Plumber, OCR or a combination of the two
+
+Parameters:
+
+ - `strip_margin`: Whether doctor should crop the edges of the recap document during processing. With PDF plumber it will ignore traditional 1 inch margin. With an OCR it lowers the threshold for hiding OCR gibberish. To enable it, set strip_margin to `True`:
+
+```bash
+curl 'http://localhost:5050/extract/recap/text/?strip_margin=True' \
+ -X 'POST' \
+ -F "file=@doctor/recap_extract/gov.uscourts.cacd.652774.40.0.pdf"
+```
+
+Valid requests will receive a JSON response with the following keys:
+
+ - `content`: The utf-8 encoded text of the file
+ - `extracted_by_ocr`: Whether OCR was needed and used during processing.
+
 
 ## Utilities
 

diff --git a/doctor/forms.py b/doctor/forms.py
@@ -95,6 +95,7 @@ def clean(self):
 class DocumentForm(BaseFileForm):
  ocr_available = forms.BooleanField(label="ocr-available", required=False)
  mime = forms.BooleanField(label="mime", required=False)
+ strip_margin = forms.BooleanField(label="strip-margin", required=False)
 
  def clean(self):
  self.clean_file()