This folder contains scripts to read Vision Zero Google spreadsheets and put them into the appropriate Postgres tables using Google Sheets API. This process is then automated using Airflow for it to run daily.
- 1. Data Source
- 2. The Automated Data Pipeline
- 3. Sheets Credentials
- 4. Adding a new year
- 5. Table generated
- 6. Pulling data with the command-line interface
Notes:
- Introduction to Google Sheets API can be found at Intro.
- A guide on how to get started can be found at Quickstart.
The School Safety Zone data are loaded from individual Google Sheets for every year since 2018. Those Google sheets are maintained by Vision Zero. The data is pulled daily by an Airflow pipeline (DAG) and can be also pulled manually by running the script gis.school_safety_zones.schools.py
with the appropriate arguments. The data are stored in a partitioned table structure under the vz_safety_programs_staging.school_safety_zone_raw_parent
and then transformed via downstream views.
The following sections describe the two approaches in more details.
A daily Airflow pipeline (DAG) runs to pull the data from Google sheets into the BigData database. The vz_google_sheets
DAG loads the Google sheets configurations from the Airflow variable ssz_spreadsheets
. This variable is a Json list of dictionaries with the following keys:
- year: The year of the data included in this spreadsheet.
- spreadsheet_id: The spreadsheet Id used to access the data
- spreadsheet_range: The sheet and cell range of data within the spreadsheet.
The DAG consists of two main tasks as shown in the below figure:
- get_list_of_years: A Python operator that extracts the required details out of
ssz_spreadsheets
and filters to specific years:- Defaults to current year and previous year based on execution date.
- Can override to pull past years by triggering DAG with
years
parameter specified.
- pull_data: An array of mapped tasks dynamically created based on the number of years retreived by the previous (upstream) task. Each of these mapped tasks pull the data of a specific year from the corresponding Google spreadsheet by calling the helper function
gis.school_safety_zones.schools.pull_from_sheet
. The Google spreadsheet credentials and the database connection details are stored encrypted in Airflow connections.
A credential file (named key.json
in the script) is required to connect to the Google Sheets to pull data, the contents of this file can be downloaded from the google console if you're logged in to the right google account. This is currently stored in an encrypted Airflow connection: google_sheets_api
.
Follow these steps to read in another spreadsheet for year yyyy
.
Create an empty table vz_safety_programs_staging.school_safety_zone_yyyy_raw
, where yyyy
is the year to be stored, as a child of parent table vz_safety_programs_staging.school_safety_zone_raw_parent
. Follow the format of the existing child tables (e.g. vz_safety_programs_staging.school_safety_zone_2018_raw
) and declare the inheritance:
CREATE TABLE vz_safety_programs_staging.school_safety_zone_yyyy_raw (
like vz_safety_programs_staging.school_safety_zone_2018_raw
including all
) INHERITS (vz_safety_programs_staging.school_safety_zone_raw_parent);
The sheet must be shared with [email protected]
. This ought to be View-only. This email is associated with the [email protected]
email and saved in the google_sheets_api
Airflow connection.
Add the new year details to the Airflow variable ssz_spreadsheets
as described above so that the DAG would start pulling its data.
The logs produce a WARNING when a line is skipped because it is missing the end of line marker.
Also check the downstream VIEWS to see if dates (dt
) & geometries are properly transformed
vz_safety_programs.polygons_school_safety_zones
: for school zone polygonsvz_safety_programs.points_wyss
: for the Watch Your Speed Signs locations
⚠ If there's any problems look at 5. Table generated below to see how the contents of the sheet is mapped to the tables.
The Geographic Competency Centre (GCC) has a process to pull the map data nightly from our database and expose it via ESRI API.
The script reads information from columns A, B, E, F, Y, Z, AA, AB which are as shown below
SCHOOL NAME | ADDRESS | FLASHING BEACON W/O | WYSS W/O | School Coordinate (X,Y) | Final Sign Installation Date | FB Locations (X,Y) | WYS Locations (X,Y) |
---|---|---|---|---|---|---|---|
AGINCOURT JUNIOR PUBLIC SCHOOL | 29 Lockie Ave | 9239020 | 9239021 | 43.788456, -79.281118 | January 9, 2019 | 43.786566, -79.279023 | 43.787530, -79.279456 |
from the Google Sheets and put them into postgres tables with the following fields (all in data type text):
school_name | address | work_order_fb | work_order_wyss | locations_zone | final_sign_installation | locations_fb | locations_wyss |
---|---|---|---|---|---|---|---|
AGINCOURT JUNIOR PUBLIC SCHOOL | 29 Lockie Ave | 9239020 | 9239021 | 43.788456, -79.281118 | January 9, 2019 | 43.786566, -79.279023 | 43.787530, -79.279456 |
Notes:
- The Google Sheets API do not read any row with empty cells at the beginning or end of the row or just an entire row of empty cells. It will log an error when that happens.
- The script being used reads up to line 180 although the actual data is less than that. This is to anticipate extra schools which might be added into the sheets in the future.
The data can be loaded into the database from the appropriate Google sheet(s) using the Linux Command Line Interface (CLI). The script gis.school_safety_zones.schools.py
requires some mandatory and optional arguments to load these data. The below table describes the script's arguments. For more details, run ./gis/school_safety_zones/schools.py --help
.
Argument | Description | Default Value |
---|---|---|
db-config | The configuration file containing the database connection parameters. | N/A |
year | The year to pull. | N/A |
spreadsheet-id | The Id of the Google spreadsheet containing the raw data. | N/A |
spreadsheet-range | The range of cells containing the raw data. | N/A |
schema | The PostgreSQL schema to load the data into. | vz_safety_programs_staging |
table | The PostgreSQL table to load the data into. | school_safety_zone_{year}_raw |
To be able to run the data puller script from the CLI, you need to save the database parameters in a file in the following format:
[DBSETTINGS]
host=HOSTNAME
database=DATABASENAME
username=USERNAME
password=PASSWORD
First, Google Sheets API was enabled on the google account. Then, a service account was created so that we are not prompted to sign in every single time we run the script. Instructions on how to do that can be found at Creating a service account. Go to the Service accounts
page from there, select the Quickstart
project and click on the Search for APIs and Services
bar to generate credentials. Copy the credentials and paste it into a key.json
file located in the same directory as the script. The key.json
file should look something like this:
"type": "service_account",
"project_id": "quickstart-1568664221624",
"private_key_id":
"private_key":
"client_email":
"client_id":
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url":