Skip to content

Database Ops

Nikhil Bhargava edited this page Apr 6, 2021 · 7 revisions

Overview

The staging (referred to throughout as "dev") and production databases are Postgresql databases managed by Google Cloud SQL. They can be accessed manually, in the rare cases where this is required, by going through our dedicated, secured VM, which we will call the "data client VM."

Querying

Do not query the database directly for real workloads. The database should only be queried to validate migrations, as detailed below. All other queries should be run against the data warehouse in BigQuery.

Data Client VM

The majority of interactions with the database are done through the data client VM. (Despite the name, you can also work with the dev database from there.)

The VM is accessed by running gcloud compute ssh [client-name].

If it is your first time logging in to the VM, you will need to set a password when you run ssh. You will then be prompted to re-enter the same password to access the VM. Once you have successfully logged in, run initial-setup from your home directory. This will set up the git repository, pip environment, and SSL certificates you will need to work with the database.

Migrations

Migration warnings and constraints

  • Do not make any changes to schema.py without also running a migration to update the database. All jobs will fail for any release in which schema.py does not match the database.

  • A migration should consist of a single change to the schema (a single change meaning a single conceptual change, which may consist of multiple update statements). Do not group changes.

  • For historical tables, the primary key exists only due to requirements of SQLAlchemy, and should never be referenced elsewhere in the schema.

  • If adding a column of type String, the length must be specified (this keeps the schema portable, as certain SQL implementations require an explicit length for String columns).

  • Do not explicitly set the column name argument for a column unless it is required to give a column the same name as a Python reserved keyword.

Running migrations

All migrations should be run from the data client VM. Once you are in the VM, follow the steps below according to the type of migration.

NOTE: All commands below assume you are running in your pip environment. To launch it, run pipenv shell from the top-level package of recidiviz-data (not your home directory).

Generating the migration

All the below options for generating a migration require you to specify a migration name. Use the same naming style as a Git branch, e.g. add_bond_id_column.

NOTE: There is a strange behavior that sometimes occurs with enums and auto-generated migrations. Typically, alembic will not detect changes to an enum like adding or removing a value, which is why the create_enum_migration script is used. However, sometimes alembic will detect those changes, and will include statements to update the enum values in the auto-generated migration. However, when the migration is run, these statements have no effect. The migration will run to completion, but the enum values in the database will remain the same. Therefore if you change the values of a enum, you should always manually create a migration using the create_enum_migration script.

Adding a value to an enum
  1. Log into prod (readonly-prod-psql for county corrections or readonly-prod-state-psql for state corrections and supervision) and run SELECT enum_range(NULL::<enum_name>); to get the current set of values for the enum.

  2. Run recidiviz/tools/create_enum_migration.py according to the instructions in its file docstring.

  3. Update the generated migration according to the included comments using the enum values from the query.

Adding or dropping a column or table
  1. Update schema.py.

  2. Run generate-prod-jails-auto-migration -m <migration name> or generate-prod-state-auto-migration -m <migration name>. (Note: These do not detect changes to enum values, which is why enum value migrations require the separate process outlined above.)

  3. Check that the generated migration looks right.

Changing the type or format of existing data
  1. Run the command to generate an empty migration. Either: generate-empty-jails-migration -m <migration name> or generate-empty-state-migration -m <migration name>.

  2. Follow the example in split_residence_field for how to apply a transformation to existing data.

Applying the migration

Whereas previously developers were responsible for applying migrations before merging, this is no longer the case.

Migrations should only be applied in the context of a staging or production deployment, and they should only be run by a Recidiviz employee with the appropriate set of permissions.

Migration troubleshooting

Re-using existing enums for new columns

If you generate a migration that adds a new column that should use an existing enum type, Alembic by default will attempt to re-create that existing type, which will cause an error during migration.

To avoid this, you need to update the migration to ensure Alembic knows to re-use the existing type:

Import the postgresql dialect, since the migration must reference a PostgreSQL enum type.

from sqlalchemy.dialects import postgresql

Then in the sa.Column constructor for the relevant column, change sa.Enum to postgresql.ENUM and add create_type=False to the enum constructor, e.g.:

sa.Column(
    'some_column',
    postgresql.ENUM('VALUE_A', 'VALUE_B', create_type=False, name='enum_name'))
New table or type already present in database

When SQLAlchemyEngineManager.init_engines_for_server_postgres_instances is called in server.py when a new job starts, CREATE statements will be executed for any tables and enum types (not enum values) found in one of the schema.py files that are not present in the database. This can cause problems, because the newly created tables will be owned by the role used by the task that called create_all, rather than the database owner role.

The best way to avoid this problem is to follow the procedure outlined above and not deploy against dev or prod with a change in schema.py without first running the corresponding migration. However, if the problem does arise, follow the instructions in "Manual intervention" below to manually drop any unintentionally created tables, then follow the normal migration procedure to re-create them properly.

NOTE: if this problem arises, the unintentionally created tables will likely have the wrong role as owner. This means you will not be able to drop them while logged in as the database owner role normally used for manual intervention. In this case, replace the user value in the manual intervention login command with whatever role owns the tables and log in with the password corresponding to that role. From there, you will be able to drop the tables as normal.

Alembic version issues

Alembic automatically manages a table called alembic_version. This table contains a single row containing the hash ID of the most recent migration run against the database. When you attempt to autogenerate or run a migration, if alembic does not see a migration file corresponding to this hash in the versions folder, the attempted action will fail.

Issues will arise if either your local copy of the versions folder does not match the alembic_version hash in the database, or if the alembic_version hashes in dev and prod don't match each other.

If your local copy of versions doesn't match the database, this is likely because someone else has run and merged a migration in the interim. In this case, move your migration file out of versions, merge the new migration from HEAD, use generate-empty-jails-migration or generate-empty-state-migration to create a new empty migration file (to ensure it follows the most recent migration in sequence), and copy the changes from your old copy of the file to the new one.

If the alembic_version values in the dev and production databases don't match each other, there could be a number of reasons:

  • A local manual migration was run against dev without being run against prod, presumably for testing purposes (this is fine). In this case, bring dev back into sync with prod via the steps in the "Syncing dev..." sections below.

  • Someone ran the above workflow but didn't merge their PR before running against prod (this is mostly fine). In this case, just have them merge the PR, then pull the latest migration file locally.

  • A checked-in migration was run against prod without being run against dev first (this is bad). In this case, log into prod (readonly-prod-psql for county or readonly-prod-state-psql for state) and check manually to see if any issues were introduced by the migration. If not, bring dev up to date with prod via the steps in the "Syncing dev..." sections below.

  • A local manual migration was run against prod without being checked in (this is very bad). Fixing this may require separately syncing both dev and prod with schema.py and then overwriting their alembic_version values, depending on what changes were made by the migration.

Syncing dev via dropping the database

This is the easiest way to get dev-data/dev-state-data back into sync with prod-data/prod-state-data and the versions folder if it gets out of sync.

  1. Log into the dev Postgres instance (dev-psql for county jails or dev-state-psql for state)

  2. Run the following command to get a list of table names to drop:

SELECT table_name 
FROM information_schema.tables 
WHERE table_schema='public' AND table_name != 'alembic_version';
  1. Use the results of that query to build a DROP TABLES query, e.g.:
DROP TABLE public.table_name_1, ....., public.table_name_n CASCADE;
  1. Run the following command to get a list of enum types to drop:
SELECT DISTINCT(pg_type.typname) 
FROM pg_type JOIN pg_enum 
ON pg_enum.enumtypid = pg_type.oid;
  1. Use the results of that query to build a DROP TYPE query, e.g.:
DROP TYPE public.type_name_1, ....., public.type_name_n CASCADE;
  1. Clear the alembic version:
DELETE FROM alembic_version;
  1. Log out of Postgres, then from the pipenv shell, apply the local migration to dev by running either:
# County Jails
$ migrate-dev-jails-to-head

or

# State
$ migrate-dev-state-to-head
Syncing dev via manual local migration

If for some reason it does not make sense to drop the entire dev DB when dev and prod become out of sync, this process can be used instead.

  1. Manually overwrite the alembic_version value in the dev Postgres instance (dev-data for county jails, dev-state-data for state) to match the prod Postgres instance (prod-data for county jails and prod-state-data for state) and the versions folder.

  2. Autogenerate a migration using dev (rather than prod) as the reference with either:

generate-dev-jails-auto-migration

or

generate-dev-state-auto-migration
  1. Apply the local migration to dev by running either:
# County Jails
$ migrate-dev-jails-to-head

or

# State
$ migrate-dev-state-to-head
  1. Delete the migration file.

  2. Manually overwrite the alembic_version value again, to undo the hash update caused by running the local migration.

Restoring backups

There are two types of database backup.

Automatic backups are created by Cloud SQL. A new automatic backup is created daily, with a TTL of 7 days, so one automatic backup is available for each day for the last 7 days. These backups are available for both dev-data/dev-state-data and prod-data/prod-state-data.

Long-term backups are created by a cron task in cron.yaml calling an endpoint in backup_manager. A new long-term backup is created weekly, with a TTL of 26 weeks, so one automatic backup is available for each week of the last 26 weeks. (Note Cloud SQL does not allow setting an actual TTL on a backup, so this "TTL" is manually enforced by the same cron task that creates new long-term backups.) These backups are only available for prod-data.

Full restore

To restore the database to an earlier state in the case of a major error, visit the "Backups" tab for the prod-data instance on GCP, select a backup, and select "Restore". This will revert the database to its state at the time of that backup.

Partial restore

Lasciate ogni speranza, voi ch'entrate

In some cases, such as a scraper writing bad data to a single region without being detected for a long period of time, you may want to restore only some subset of rows, columns, or tables, rather than reverting the entire database to some earlier state.

This process is fairly finicky, especially for more complex restores. You should probably do a practice run against dev-data/dev-state-data before trying to do this on prod.

  1. On the "Backups" tab, make a manual backup of the current state of prod-data, to ensure the initial state can be restored if the process goes awry.

  2. On GCP, create a new Cloud SQL instance with Postgres.

  3. On the "Connections" tab of the new Cloud SQL instance, add the data client VM as an authorized network.

  4. On the "Backups" tab of prod-data, choose the appropriate backup and select "Restore". Update the target instance to the temporary Cloud SQL instance (NOT the regular prod-data instance), and execute the restore.

  5. On the data client VM, connect to the temporary instance with: psql "sslmode=disable hostaddr=<IP of temporary Cloud SQL instance> user=postgres dbname=postgres"

  6. Export the data to be restored with: \copy (<query for desired data>) TO '<filename>'

  7. Using the command in the "Manual intervention" section below, connect to prod-data with manual intervention permissions.

  8. Create a temporary empty copy of the table you are restoring to: CREATE TEMP TABLE <temp table name> AS SELECT <whichever columns you need> FROM <original table> LIMIT 0;

  9. Copy from the file you created previously to the temporary table: \copy <temp table name> FROM '<filename>'

  10. Perform whatever standard SQL operations are required to copy the desired rows/columns from the temporary table to the regular table.

  11. Repeat steps 5-10 for any other tables that need to be restored.

  12. Delete the temporary Cloud SQL instance.

  13. Delete the manual backup of prod-data.

Manual intervention

In the (extremely rare) case where manual operations need to be performed directly on the database (which you shouldn't do), you can run the below command (but please don't) on the data client VM to access the database with full permissions.

The dev-psql and dev-state-psql commands always give full permissions on dev-data and dev-state-data, respectively, so manual operations there can be done freely.

Setting up the VM

If a new data client VM needs to be created, follow the detailed instructions in our internal documentation.