-
Notifications
You must be signed in to change notification settings - Fork 24
Metric Calculation
As discussed previously, our ingest system is capable of transforming raw, arbitrary data about the criminal justice system into organized, normalized information in a common format. However, that is of little use if we don't then analyze that information to understand the performance and health of our criminal justice system.
There is an enormous wealth of high quality research in criminal justice, but the space has severe challenges. Virtually every bit of research behind reform requires enormous levels of effort to acquire, prepare, and analyze data. Even having done so, the output of the research is all too often limited in terms of potential impact. Just as there has historically been no common schema for justice data, there has historically been little standardization of metric definitions and methodologies -- because of this, reproducing results remains a challenge and comparison across jurisdictions or agencies can be waved away, reducing accountability.
We calculate standardized metrics from our common schema. Calculations include both batch-oriented processing (looking at historical data to produce more sophisticated metrics, such as recidivism analysis) and query-driven processing. We look at all combinations of dimensions in the underlying dataset to create a sliding scale of granularity to empower different kinds of users with the right level of detail. By calculating from a universal dataset composed from all of the incoming data systems, we can track outcomes on an ongoing basis.
There are currently three types of channels for calculation: batch processing, query processing, and manual exploration. All of these calculations are stored in our data warehouse.
- Batch processing - via Cloud Dataflow, batch jobs can be executed which perform complex logic on entities and entity graphs to identify measure certain events within the justice system. For example, recidivism measurement requires looking at the full history of a person's interactions with the correctional and supervision systems. These jobs read individual-level data exported from our database to our data warehouse, and write metrics back into the data warehouse where they can be directly consumed, or joined with query results.
-
Query processing - BigQuery, our data warehouse, supports the registration of views: a virtual table defined by a saved SQL query. For many classes of metrics, a SQL query that joins across some number of tables is sufficient to produce the desired calculations. Views can reference other views and the tables produced by batch processing jobs, providing ample flexibility to share common query logic, produce methodological variants, and build up a common calculation language.
- Querying a view is much like querying a standard table in that it simply executes a SQL query and returns the corresponding result set -- these result sets must be provided to consumers who want to report on the information. This is described in further detail in Data Warehouse.
- Manual exploration - BigQuery permissions can be granted on specific datasets to specific users, allowing both internal staff and authorized partners to explore the data warehouse with all of the tooling available to BigQuery users, including the BigQuery console, Python/Jupyter notebooks, scripting via R or Python, and more. Virtually all of our calculations that eventually end up in consuming applications begins as manual exploration, and some projects involve a significant initial exploration effort while the trail is still being blazed.
There are a wide variety of access patterns available to the calculations in our data warehouse, given that BigQuery has a wide API surface. At present, we have established a few main access patterns, but expansion is on the way.
- Programmatic - the BigQuery API has client bindings in most popular languages, including Java, R, and Python. We have used the API directly to power our batch processing jobs, and any authorized partner can read from the warehouse in the same fashion once they have been granted access to a desired dataset. It is likely that at a future point we will host a shim API in front of this for domain-specific calculation retrieval.
- Direct SQL querying - the BigQuery console provides authenticated users the ability to execute SQL directly in the browser. This is useful for manual exploration and troubleshooting, but is also sufficient for a good number of users whose informational needs are quantitatively simpler.
- Data exchange - on a case-by-case basis, we may directly transfer data out of BigQuery to a partner through authorized exports of some subset of a desired dataset. This tends to be a one-off operation, but it is conceivable that some future exchange would be built off of periodic exports of datasets to an available data portal.
By convention, we opt to pre-calculate as much as possible instead of performing calculations on the fly as client applications require. This is for a few simple reasons:
- High dimensionality - our batch processing in particular, but also our query processing, are designed to produce as many of the possible breakdowns of a given metric as possible, e.g. all combinations of all dimensions in the metric space. This enables significantly different use cases, with significantly different required granularity, to be built off of common metrics. Calculating the full matrix (sparse though it may be) of something like, say, recidivism metrics, means that we always have the number we need, when we need it.
- Calculation is cheap - though calculation can sometimes be complex, it is computationally and fiscally cheap. The overall scale of justice data is such that re-calculating all breakdowns for all metrics across all of time every day is totally feasible.
- Lower complexity - the system complexity of performing arbitrary on-demand calculation would appear to be significantly higher than that of a system that automates ingest and calculation from end-to-end and outputs metrics into desired forms and locations.
- Stronger invariants - because everything is pre-calculated every day, any given point in time, a user can assume that the calculation they are looking at is the most up-to-date version thereof, unless they explicitly requested an older version. Similarly, users looking at sets of related metrics can be confident that the metrics are in sync and based off of a consistent sample of base data.
- Home
- Architecture
- Schemas
- Methodology
- Data Extraction
- Data Normalization
- Entity Matching
- Recidivism Measurement
- Development
- Local Development
- Create a Scraper
- Add a New Schema
- Update BigQuery Views
- Continuous Integration
- Operations