This repository aims to create a simple set of spreadsheets extracted from the EPA's Risk Management Program database, obtained by the Data Liberation Project via FOIA.
The sql/
directory contains two files, each with SQLite commands:
sql/views.sql
defines a series of SQL views based on the raw data.sql/export.sql
exports the top-level views (Facilities
,Submissions
, andAccidents
) to CSV files indata/output/
The data/output/
directory contains three files, each the result of the SQL commands above:
-
data/output/facilities.csv
: One row per facility in the RMP database, providing its name, state, city, latest-listed owners/operator, number of submissions, date of latest RMP submission (by latest EPA-validation date), chemicals reported in latest submission, number of accidents reported in latest submission, and chemicals involved in those accidents. -
data/output/submissions.csv
: One row per RMP submission (rather than one row per facility), with similar information as above. -
data/output/accidents.csv
: One row per accident reported, with obvious duplicates removed. Due to reporting requirements, facilities may report the same accident in multiple submissions over time. This causes duplicate accidents to appear in the raw data.- The Data Liberation Project has attempted to remove these duplicates, using a simple rule: If two accidents are reported in the same submission, they are not duplicates of one another; otherwise, all reports of accidents on the same day at the same facility are considered to reference the same event, and only the most recent submission’s version of the accident report is retained.
- Cross-checking the results against public accounting by the EPA, and spot-checking the results internally, suggests that this is a robust approach. Weaknesses include potentially double-counting (a) accidents for which a facility has changed the date between submissions, (b) accidents that a facility has reported twice in the same submission (of which there appear to be just one or two likely instances).
-
data/output/accidents-with-duplicates.csv
: One row per accident listed in the raw data, including duplicates (see note directly above).
The data/dictionaries/
directory contains a data dictionary for each of the output files, explaining the columns.
The data/changes/
directory contains one subdirectory per RMP data-update, with each subdirectory named YYYY-MM
after the year and month of the Data Liberation Project received the updated data from the EPA.
Each subdirectory contains one file per file in the data/output/
directory, but suffixed with .json
instead of .csv
. Each of these JSON files contains the output of csv-diff --json old-file.csv new-file.csv --id [...]
, run using the csv-diff
command-line tool. (See the Makefile
for specifics.) They indicate the rows added, removed, and changed between the older and newer data.
- If the
data/raw/
directory does not already exist, create it. - Download
RMPData.sqlite
from this Google Drive folder. - Copy or move that file into
data/raw/
.
make data
will regenerate the output data from the SQL files above.
This repository's code is available under the MIT License terms. The data files are available under Creative Commons' CC BY-SA 4.0 license terms.
File them as an issue in this repository or email Jeremy at [email protected].