-
Notifications
You must be signed in to change notification settings - Fork 24
Database Ops
The staging (referred to throughout as "dev") and production databases are Postgresql databases managed by Google Cloud SQL. They can be accessed manually, in the rare cases where this is required, by going through our dedicated, secured VM, which we will call the "data client VM."
Do not query the database directly for real workloads. The database should only be queried to validate migrations, as detailed below. All other queries should be run against the data warehouse in BigQuery.
The majority of interactions with the database are done through the data client VM. (Despite the name, you can also work with the dev database from there.)
The VM is accessed by running gcloud compute ssh [client-name]
.
If it is your first time logging in to the VM, you will need to set a password when you run ssh. You will then be prompted to re-enter the same password to access the VM. Once you have successfully logged in, run initial-setup
from your home directory. This will set up the git repository, pip environment, and SSL certificates you will need to work with the database.
-
Do not make any changes to
schema.py
without also running a migration to update the database. All jobs will fail for any release in whichschema.py
does not match the database. -
A migration should consist of a single change to the schema (a single change meaning a single conceptual change, which may consist of multiple update statements). Do not group changes.
-
For historical tables, the primary key exists only due to requirements of SQLAlchemy, and should never be referenced elsewhere in the schema.
-
If adding a column of type String, the length must be specified (this keeps the schema portable, as certain SQL implementations require an explicit length for String columns).
-
Do not explicitly set the column name argument for a column unless it is required to give a column the same name as a Python reserved keyword.
All migrations should be run from the data client VM. Once you are in the VM, follow the steps below according to the type of migration.
NOTE: All commands below assume you are running in your pip environment. To launch it, run pipenv shell
from the top-level package of recidiviz-data
(not your home directory).
All the below options for generating a migration require you to specify a migration name. Use the same naming style as a Git branch, e.g. add_bond_id_column
.
NOTE: There is a strange behavior that sometimes occurs with enums and auto-generated migrations. Typically, alembic will not detect changes to an enum like adding or removing a value, which is why the create_enum_migration
script is used. However, sometimes alembic will detect those changes, and will include statements to update the enum values in the auto-generated migration. However, when the migration is run, these statements have no effect. The migration will run to completion, but the enum values in the database will remain the same. Therefore if you change the values of a enum, you should always manually create a migration using the create_enum_migration
script.
-
Log into prod (
readonly-prod-psql
for county corrections orreadonly-prod-state-psql
for state corrections and supervision) and runSELECT enum_range(NULL::<enum_name>);
to get the current set of values for the enum. -
Run
recidiviz/tools/create_enum_migration.py
according to the instructions in its file docstring. -
Update the generated migration according to the included comments using the enum values from the query.
-
Update
schema.py
. -
Run
generate-prod-jails-auto-migration -m <migration name>
orgenerate-prod-state-auto-migration -m <migration name>
. (Note: These do not detect changes to enum values, which is why enum value migrations require the separate process outlined above.) -
Check that the generated migration looks right.
-
Run the command to generate an empty migration. Either:
generate-empty-jails-migration -m <migration name>
orgenerate-empty-state-migration -m <migration name>
. -
Follow the example in split_residence_field for how to apply a transformation to existing data.
Whereas previously developers were responsible for applying migrations before merging, this is no longer the case.
Migrations should only be applied in the context of a staging or production deployment, and they should only be run by a Recidiviz employee with the appropriate set of permissions.
If you generate a migration that adds a new column that should use an existing enum type, Alembic by default will attempt to re-create that existing type, which will cause an error during migration.
To avoid this, you need to update the migration to ensure Alembic knows to re-use the existing type:
Import the postgresql
dialect, since the migration must reference a PostgreSQL enum type.
from sqlalchemy.dialects import postgresql
Then in the sa.Column
constructor for the relevant column, change sa.Enum
to postgresql.ENUM
and add create_type=False
to the enum constructor, e.g.:
sa.Column(
'some_column',
postgresql.ENUM('VALUE_A', 'VALUE_B', create_type=False, name='enum_name'))
When SQLAlchemyEngineManager.init_engines_for_server_postgres_instances
is called in server.py
when a new job starts, CREATE
statements will be executed for any tables and enum types (not enum values) found in one of the schema.py
files that are not present in the database. This can cause problems, because the newly created tables will be owned by the role used by the task that called create_all
, rather than the database owner role.
The best way to avoid this problem is to follow the procedure outlined above and not deploy against dev or prod with a change in schema.py
without first running the corresponding migration. However, if the problem does arise, follow the instructions in "Manual intervention" below to manually drop any unintentionally created tables, then follow the normal migration procedure to re-create them properly.
NOTE: if this problem arises, the unintentionally created tables will likely have the wrong role as owner. This means you will not be able to drop them while logged in as the database owner role normally used for manual intervention. In this case, replace the user
value in the manual intervention login command with whatever role owns the tables and log in with the password corresponding to that role. From there, you will be able to drop the tables as normal.
Alembic automatically manages a table called alembic_version
. This table contains a single row containing the hash ID of the most recent migration run against the database. When you attempt to autogenerate or run a migration, if alembic does not see a migration file corresponding to this hash in the versions
folder, the attempted action will fail.
Issues will arise if either your local copy of the versions
folder does not match the alembic_version
hash in the database, or if the alembic_version
hashes in dev and prod don't match each other.
If your local copy of versions
doesn't match the database, this is likely because someone else has run and merged a migration in the interim. In this case, move your migration file out of versions
, merge the new migration from HEAD, use generate-empty-jails-migration
or generate-empty-state-migration
to create a new empty migration file (to ensure it follows the most recent migration in sequence), and copy the changes from your old copy of the file to the new one.
If the alembic_version
values in the dev and production databases don't match each other, there could be a number of reasons:
-
A local manual migration was run against dev without being run against prod, presumably for testing purposes (this is fine). In this case, bring dev back into sync with prod via the steps in the "Syncing dev..." sections below.
-
Someone ran the above workflow but didn't merge their PR before running against prod (this is mostly fine). In this case, just have them merge the PR, then pull the latest migration file locally.
-
A checked-in migration was run against prod without being run against dev first (this is bad). In this case, log into prod (
readonly-prod-psql
for county orreadonly-prod-state-psql
for state) and check manually to see if any issues were introduced by the migration. If not, bring dev up to date with prod via the steps in the "Syncing dev..." sections below. -
A local manual migration was run against prod without being checked in (this is very bad). Fixing this may require separately syncing both dev and prod with
schema.py
and then overwriting theiralembic_version
values, depending on what changes were made by the migration.
This is the easiest way to get dev-data
/dev-state-data
back into sync with prod-data
/prod-state-data
and the versions
folder if it gets out of sync.
-
Log into the dev Postgres instance (
dev-psql
for county jails ordev-state-psql
for state) -
Run the following command to get a list of table names to drop:
SELECT table_name
FROM information_schema.tables
WHERE table_schema='public' AND table_name != 'alembic_version';
- Use the results of that query to build a
DROP TABLES
query, e.g.:
DROP TABLE public.table_name_1, ....., public.table_name_n CASCADE;
- Run the following command to get a list of enum types to drop:
SELECT DISTINCT(pg_type.typname)
FROM pg_type JOIN pg_enum
ON pg_enum.enumtypid = pg_type.oid;
- Use the results of that query to build a
DROP TYPE
query, e.g.:
DROP TYPE public.type_name_1, ....., public.type_name_n CASCADE;
- Clear the alembic version:
DELETE FROM alembic_version;
- Log out of Postgres, then from the
pipenv shell
, apply the local migration to dev by running either:
# County Jails
$ migrate-dev-jails-to-head
or
# State
$ migrate-dev-state-to-head
If for some reason it does not make sense to drop the entire dev DB when dev and prod become out of sync, this process can be used instead.
-
Manually overwrite the
alembic_version
value in the dev Postgres instance (dev-data
for county jails,dev-state-data
for state) to match the prod Postgres instance (prod-data
for county jails andprod-state-data
for state) and theversions
folder. -
Autogenerate a migration using dev (rather than prod) as the reference with either:
generate-dev-jails-auto-migration
or
generate-dev-state-auto-migration
- Apply the local migration to dev by running either:
# County Jails
$ migrate-dev-jails-to-head
or
# State
$ migrate-dev-state-to-head
-
Delete the migration file.
-
Manually overwrite the
alembic_version
value again, to undo the hash update caused by running the local migration.
There are two types of database backup.
Automatic backups are created by Cloud SQL. A new automatic backup is created daily, with a TTL of 7 days, so one automatic backup is available for each day for the last 7 days. These backups are available for both dev-data
/dev-state-data
and prod-data
/prod-state-data
.
Long-term backups are created by a cron task in cron.yaml
calling an endpoint in backup_manager
. A new long-term backup is created weekly, with a TTL of 26 weeks, so one automatic backup is available for each week of the last 26 weeks. (Note Cloud SQL does not allow setting an actual TTL on a backup, so this "TTL" is manually enforced by the same cron task that creates new long-term backups.) These backups are only available for prod-data
.
To restore the database to an earlier state in the case of a major error, visit the "Backups" tab for the prod-data
instance on GCP, select a backup, and select "Restore". This will revert the database to its state at the time of that backup.
Lasciate ogni speranza, voi ch'entrate
In some cases, such as a scraper writing bad data to a single region without being detected for a long period of time, you may want to restore only some subset of rows, columns, or tables, rather than reverting the entire database to some earlier state.
This process is fairly finicky, especially for more complex restores. You should probably do a practice run against dev-data
/dev-state-data
before trying to do this on prod.
-
On the "Backups" tab, make a manual backup of the current state of
prod-data
, to ensure the initial state can be restored if the process goes awry. -
On GCP, create a new Cloud SQL instance with Postgres.
-
On the "Connections" tab of the new Cloud SQL instance, add the data client VM as an authorized network.
-
On the "Backups" tab of
prod-data
, choose the appropriate backup and select "Restore". Update the target instance to the temporary Cloud SQL instance (NOT the regularprod-data
instance), and execute the restore. -
On the data client VM, connect to the temporary instance with:
psql "sslmode=disable hostaddr=<IP of temporary Cloud SQL instance> user=postgres dbname=postgres"
-
Export the data to be restored with:
\copy (<query for desired data>) TO '<filename>'
-
Using the command in the "Manual intervention" section below, connect to
prod-data
with manual intervention permissions. -
Create a temporary empty copy of the table you are restoring to:
CREATE TEMP TABLE <temp table name> AS SELECT <whichever columns you need> FROM <original table> LIMIT 0;
-
Copy from the file you created previously to the temporary table:
\copy <temp table name> FROM '<filename>'
-
Perform whatever standard SQL operations are required to copy the desired rows/columns from the temporary table to the regular table.
-
Repeat steps 5-10 for any other tables that need to be restored.
-
Delete the temporary Cloud SQL instance.
-
Delete the manual backup of
prod-data
.
In the (extremely rare) case where manual operations need to be performed directly on the database (which you shouldn't do), you can run the below command (but please don't) on the data client VM to access the database with full permissions.
The dev-psql
and dev-state-psql
commands always give full permissions on dev-data
and dev-state-data
, respectively, so manual operations there can be done freely.
If a new data client VM needs to be created, follow the detailed instructions in our internal documentation.
- Home
- Architecture
- Schemas
- Methodology
- Data Extraction
- Data Normalization
- Entity Matching
- Recidivism Measurement
- Development
- Local Development
- Create a Scraper
- Add a New Schema
- Update BigQuery Views
- Continuous Integration
- Operations