January 5, 2023

Updates from PDAP

The new https://pdap.io front page points people to the new "request data" workflow.
- We've been doing this informally already—we'd like to get more requests, so we can better understand the kinds of things people are looking for and make more of an impact.
We're making good progress on turning our Airtable tools into a home-grown app. There's now a public repo which mirrors our Data Sources database as CSV and JSON.
We're working on ways to identify URLs en masse.
- GitHub issue: https://github.com/Police-Data-Accessibility-Project/planning/issues/196
- A volunteer wrote a sitemap scraper, which locates potentially useful URLs given a list. It's an open PR still under review. We want to get it merged soon: https://github.com/Police-Data-Accessibility-Project/scrapers/pull/195/files
- We did a Doccano labeling exercise in an attempt to train a machine learning algorithm to identify content based on URL. We still need to experiment in order to close the loop on this.

linklabel regex script: Police-Data-Accessibility-Project/data-source-identification#1
- (C++) scans massive amounts of URLs for keywords
- early bottleneck: commoncrawl URL database is ~4TB
- to do:
  - publish the script + regex library
  - generate lists of URLs
  - get a good list of regex keywords set in advance
  - crunch through the URLs
  - ???
Elasticsearch
- Supports stemming
Commoncrawl storage
- craeft offered to spin up a 4-5TB linux server to hold mass amounts of URLs
- goal: people can get batches of URLs off the server