Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider incorporating new web-portal metadata fields #61

Open
jsvine opened this issue Jun 10, 2023 · 0 comments
Open

Consider incorporating new web-portal metadata fields #61

jsvine opened this issue Jun 10, 2023 · 0 comments

Comments

@jsvine
Copy link
Contributor

jsvine commented Jun 10, 2023

The metadata returned by the APHIS portal recently added some new fields: zip, state, city, and certType. Now result entries look like this:

{
  "certNumber": "83-R-0001",
  "certType": "Class R - Research Facility",
  "city": "LARAMIE",
  "critical": 0,
  "customerNumber": "16",
  "direct": 0,
  "inspectionDate": "2023-04-17",
  "inspectionDateString": "4/17/2023",
  "legalName": "University of Wyoming",
  "nonCritical": 0,
  "reportLink": "https://aphis--c.na107.content.force.com/[...]",
  "siteName": "UNIVERSITY OF WYOMING",
  "state": "Wyoming",
  "teachableMoments": 0,
  "zip": "82071"
}

This causes csv.DictWriter to throw an error when writing inspections.csv, because the field names for that CSV are based on our cached results, which did not have those fields. Commit 92179d9 prevents the error the simplest way, by adding the extrasaction="ignore" parameter to the csv.DictWriter instantiation.

And although we can get the same data via the records we already have (and in fact are already pulling out certificate type and state), we might still want to add these four columns.

Benefits of doing this:

  • A more complete reflection of the data available through the web portal.
  • The data could provide a useful cross-check on the same information we're extracting from the PDFs.

Costs / limitations:

  • It'll take some work to get these new columns backfilled for the historical data while not losing key info such as the pipeline discovery date for each inspection.
  • Adding these fields will make the file sizes larger, bringing us to GitHub's individual-file size limits faster. (data/combined/inspections.csv is currently ~48MB, halfway toward the 100MB limit.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant