This Python module allows scraping of IMR (Independent Monitoring Report) PDFs to extract CASA (Court Approved Settlement Agreement) paragraph compliance and page information into a tabular format.
imrscrape is available as an importable Python module and as a CLI tool.
clone this repo:
git clone https://github.com/apd-forward/imr-scrape
run setup.py
python setup.py
imrscrape -i ./imr-8-final.pdf -o ./imr-8-data.csv
-
-i --input [filepath] (required)
Takes the filepath to the PDF of the IMR to be scraped
-
-o --output [filepath] (required)
Take the filepath to a csv for the results
-
-qa
returns a QA/QC report of possible missing paragraphs to stdout
This module is written using Python >3.7.0 syntax. Dependencies for development are managed with pipenv. Code is formatted with black.