Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dockerize whole zoning-api data flow #44

Closed
wants to merge 30 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
605c1f4
dockerize whole zoning-api data flow
TangoYankee Apr 4, 2024
fed7f6e
ubuntu runner against postgis db
TangoYankee Apr 5, 2024
35c7c15
add postgis service
TangoYankee Apr 5, 2024
1d4a793
install packages to action
TangoYankee Apr 5, 2024
1ed66ff
sudo !!
TangoYankee Apr 5, 2024
022fb4d
apt -> apt-get
TangoYankee Apr 5, 2024
5fcb4f0
configure secrets
TangoYankee Apr 5, 2024
c94477d
steps until update api db
TangoYankee Apr 5, 2024
acb242e
check minio client install
TangoYankee Apr 5, 2024
b346fcb
check minio client install
TangoYankee Apr 5, 2024
44b1221
containerize ubuntu
TangoYankee Apr 5, 2024
bd08e07
mv mc into local bin
TangoYankee Apr 5, 2024
08cc6cf
remove slashes from mcw
TangoYankee Apr 5, 2024
ddc485d
label for do access key
TangoYankee Apr 5, 2024
70d9d49
localhost database
TangoYankee Apr 5, 2024
6d85c22
secret key name
TangoYankee Apr 5, 2024
6a29241
engine port is number
TangoYankee Apr 5, 2024
07a10ec
engine port is hard coded
TangoYankee Apr 5, 2024
b3c8376
hard code all engine db values
TangoYankee Apr 5, 2024
78afbd6
check pg_dump version
TangoYankee Apr 5, 2024
3db6566
remove postgres client install
TangoYankee Apr 5, 2024
5751114
upgrade postgres with postgres client
TangoYankee Apr 5, 2024
e05ab4a
uninstall posgresql-14
TangoYankee Apr 5, 2024
7cf9423
sudo removal of postgresql 14
TangoYankee Apr 5, 2024
fc29ac7
remove of postgresql client 14
TangoYankee Apr 5, 2024
bfa7a26
remove of postgresql client 14
TangoYankee Apr 5, 2024
8b954eb
remove errant postgresql install
TangoYankee Apr 5, 2024
6cb3645
install postgres-15
TangoYankee Apr 5, 2024
ad75d96
kitchen sink postgres dependencies
TangoYankee Apr 5, 2024
359660d
rely on actions to install python and postgres
TangoYankee Apr 5, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 96 additions & 0 deletions .github/workflows/test_update_api_database.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
on: [push]

jobs:
etl:
runs-on: ubuntu-latest
services:
postgres:
image: postgis/postgis:15-3.4-alpine
env:
POSTGRES_PASSWORD: postgres
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
ports:
# Maps tcp port 5432 on service container to the host
- 5432:5432
steps:
- name: check out repo code
uses: actions/checkout@v4
- name: Load Secrets
uses: 1password/load-secrets-action@v1
with:
export-env: true
env:
OP_SERVICE_ACCOUNT_TOKEN: ${{ secrets.OP_SERVICE_ACCOUNT_TOKEN }}
DO_SPACES_ENDPOINT: "op://AE Data Flow/Digital Ocean - S3 file storage/DO_SPACES_ENDPOINT"
DO_SPACES_ACCESS_KEY: "op://AE Data Flow/Digital Ocean - S3 file storage/DO_SPACES_ACCESS_KEY"
DO_SPACES_SECRET_KEY: "op://AE Data Flow/Digital Ocean - S3 file storage/DO_SPACES_SECRET_KEY"
DO_SPACES_BUCKET_DISTRIBUTIONS: "op://AE Data Flow/Digital Ocean - S3 file storage/DO_SPACES_BUCKET_DISTRIBUTIONS"
DO_ZONING_API_DB_HOST: "op://AE Data Flow/Digital Ocean DB Cluster - Zoning API/host"
DO_ZONING_API_DB_PORT: "op://AE Data Flow/Digital Ocean DB Cluster - Zoning API/port"
DO_ZONING_API_DB_USERNAME_DEV: "op://AE Data Flow/Digital Ocean DB Cluster - Zoning API dev/username"
DO_ZONING_API_DB_PASSWORD_DEV: "op://AE Data Flow/Digital Ocean DB Cluster - Zoning API dev/password"
DO_ZONING_API_DB_DATABASE_DEV: "op://AE Data Flow/Digital Ocean DB Cluster - Zoning API dev/database"
- name: Set .env file
run: |
echo "BUILD_ENGINE_HOST=127.0.0.1" >> .env
echo "BUILD_ENGINE_PORT=5432" >> .env
echo "BUILD_ENGINE_USER=postgres" >> .env
echo "BUILD_ENGINE_PASSWORD=postgres" >> .env
echo "BUILD_ENGINE_DB=postgres" >> .env
echo "DO_SPACES_ENDPOINT=$DO_SPACES_ENDPOINT" >> .env
echo "DO_SPACES_ACCESS_KEY=$DO_SPACES_ACCESS_KEY" >> .env
echo "DO_SPACES_SECRET_KEY=$DO_SPACES_SECRET_KEY" >> .env
echo "DO_SPACES_BUCKET_DISTRIBUTIONS=$DO_SPACES_BUCKET_DISTRIBUTIONS" >> .env
echo "ZONING_API_HOST=$DO_ZONING_API_DB_HOST" >> .env
echo "ZONING_API_PORT=$DO_ZONING_API_DB_PORT" >> .env
echo "ZONING_API_USER=$DO_ZONING_API_DB_USERNAME_DEV" >> .env
echo "ZONING_API_PASSWORD=$DO_ZONING_API_DB_PASSWORD_DEV" >> .env
echo "ZONING_API_DB=$DO_ZONING_API_DB_DATABASE_DEV" >> .env

- name: Install prerequisite packages
run: |
sudo apt-get update
sudo apt-get install -y wget
sudo apt-get install -y git

- name: Setup PostgreSQL
uses: tj-actions/install-postgresql@v3
with:
postgresql-version: 15

- name: Check postgres install
run: pg_dump --version

- name: Install minio client
run: |
sudo wget https://dl.min.io/client/mc/release/linux-amd64/mc
sudo chmod +x mc
sudo mv mc /usr/local/bin

- name: Setup python
uses: actions/setup-python@v5
with:
python-version-file: ".python-version"

- name: Install python dependencies
run: pip install -r requirements.txt

- name: Install dbt dependencies
run: dbt deps

- name: Download
run: ./bash/download.sh

- name: Import
run: ./bash/import.sh

- name: Transform
run: ./bash/transform.sh

- name: Export
run: ./bash/export.sh

48 changes: 48 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
FROM ubuntu:latest

RUN apt-get update

# RUN apt install -y wget gpg gnupg2 software-properties-common apt-transport-https lsb-release ca-certificates
RUN apt-get install -y wget
RUN apt-get install -y software-properties-common

# psql from postgres-client
RUN sh -c 'echo "deb https://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
RUN wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | apt-key add -
RUN apt-get update
RUN apt-get install -y postgresql-client-15


# minio client
RUN wget https://dl.min.io/client/mc/release/linux-amd64/mc
RUN chmod +x mc
RUN mv mc /usr/local/bin

# python
COPY requirements.txt /requirements.txt
RUN apt-get install -y python3 python3-pip
RUN pip install -r requirements.txt

# dbt
## config
COPY dbt_project.yml /dbt_project.yml
COPY package-lock.yml /package-lock.yml
COPY packages.yml /packages.yml
COPY profiles.yml /profiles.yml
## install
RUN apt-get install -y git
RUN dbt deps
## tests
COPY tests /tests

# etl
## scripts
COPY bash ./bash
## commands
COPY sql /sql
## local source files
COPY borough.csv /borough.csv
COPY land_use.csv /land_use.csv
COPY zoning_district_class.csv /zoning_district_class.csv

CMD ["sleep", "infinity"]
110 changes: 18 additions & 92 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,25 +5,23 @@ This is the primary repository for the data pipelines of the Application Enginee
These pipelines are used to populate the databases used by our APIs and are called "data flows".

## Design
For all AE data flows, there is an ephemeral database within a docker-ized runner

For all AE data flows, there is one database cluster with a `staging` and a `prod` database. There are also `dev` databases. These are called data flow databases.

For each API, there is a database cluster with a `staging` and a `prod` database. The only tables in those databases are those that an API uses. These are called API databases.
For each API, there is a database cluster with a `data-qa` and a `prod` database. The only tables in those databases are those that an API uses. These are called API databases.

For each API and the relevant databases, this is the approach to updating data:

1. Load source data into the data flow database
1. Load source data into the data flow ephemeral database
2. Create tables that are identical in structure to the API database tables
3. Replace the rows in the API database tables

These steps are first performed on the `staging` sets of databases. When that process has succeeded and the API's use of it has passed QA, the same process is performed on the `prod` set of databases.
The exact data flow steps are refined while working in a `local` docker environment. After the steps are stable, they are merged into `main`. From there, they are run first against a `data-qa` API database from within the `data-flow` GitHub action. After receiving quality checks, the `data-flow` GitHub Action is targeted against the `prod` API database.

This is a more granular description of those steps:

1. Download CSV files from Digital Ocean file storage
2. Copy CSV files into source data tables
3. Test source data tables
4. Create API tables in the data flow database
4. Create API tables in the data flow ephemeral database
5. Populate the API tables in data flow database
6. Replace rows in API tables in the API database

Expand All @@ -37,112 +35,40 @@ We use a github action to perform API database updates.

We have three [environments](https://docs.github.com/en/actions/deployment/targeting-different-environments/using-environments-for-deployment) to configure the databases and credentials used for an API database update.

The `dev` environment can used on any branch. The `staging` and `production` environments can only be used on the `main` branch.
The `dev` environment can used on any branch. The `data-qa` and `production` environments can only be used on the `main` branch.

When an action attempts to use the `production` environment, specific people or teams specified in this repo's settings must approve the action run's access of environment.

## Local setup

### Setup MiniO for S3 file transfers

> [!NOTE]
> These instructions are for local setup on macOS.

For non-public files like our CSVs in `/edm/distribution/`, we can use [minio](https://github.com/minio/minio) for authenticated file transfers.

#### Install

```bash
brew install minio/stable/mc
```

#### Add DO Spaces to the `mc` configuration

```bash
mc alias set spaces $DO_SPACES_ENDPOINT $DO_SPACES_ACCESS_KEY $DO_SPACES_SECRET_KEY
```

We use `spaces` here but you can name the alias anything. When you run `mc config host list` you should see the newly added host with credentials from your `.env`.

### Setup python virtual environment

> [!NOTE]
> These instructions are for use of [pyenv](https://github.com/pyenv/pyenv) to manage python virtual environments. See [these instructions](https://github.com/pyenv/pyenv?tab=readme-ov-file#installation) to install it.
>
> If you are using a different approach like [venv](https://docs.python.org/3/library/venv.html) or [virtualenv](https://virtualenv.pypa.io/en/latest/), follow comparable instructions in the relevant docs.

The `.python-version` file defines which version of python this project uses.

#### Install

```bash
brew install pyenv
brew install pyenv-virtualenv
```

#### Create a virtual environment named `venv_ae_data_flow`

```bash
pyenv virtualenv venv_ae_data_flow
pyenv virtualenvs
```

#### Activate `venv_ae_data_flow` in the current terminal

```bash
pyenv activate venv_ae_data_flow
pyenv version
```

#### Install dependencies

```bash
python3 -m pip install --force-reinstall -r requirements.txt
pip list
dbt deps
```

### Setup postgres

We use `postgres` version 15 in order to use the `psql` CLI.

```bash
brew install postgresql@15
# Restart the terminal
psql --version
```

## Local usage
> These instructions depend on docker and docker compose
> If you need to install docker compose, follow [these instructions](https://docs.docker.com/compose/install/).

### Set environment variables

Create a file called `.env` in the root folder of the project and copy the contents of `sample.env` into that new file.

Next, fill in the blank values.

> [!IMPORTANT]
> To use a local database, `sample_local.env` likely has the environment variable values you need.
>
> To use a deployed database in Digital Ocean, the values you need can be found in the AE 1password vault.
### Run the local zoning api database
The `data-flow` steps are run against the `zoning-api` database. Locally, this relies on these two containers running on the same network. The zoning-api creates the network, which the data-flow then joins.
Before continuing with the `data-flow` setup, follow the steps within `nycplanning/ae-zoning-api` to get its database running in a container.

### Run local database with docker compose

Next, use [docker compose](https://docs.docker.com/compose/) to stand up a local PostGIS database.
### Run data-flow local database with docker compose

```bash
docker compose up
```

If you need to install docker compose, follow [these instructions](https://docs.docker.com/compose/install/).

### Run each step
### Run each step to complete the data flow

```bash
./bash/download.sh
./bash/import.sh
./bash/transform.sh
./bash/export.sh
./bash/update_api_db.sh
docker compose exec data-flow bash ./bash/download.sh
docker compose exec data-flow bash ./bash/import.sh
docker compose exec data-flow bash ./bash/transform.sh
docker compose exec data-flow bash ./bash/export.sh
docker compose exec data-flow bash ./bash/update_api_db.sh
```

If you receive an error, make sure the script has the correct permissions:
Expand Down
3 changes: 3 additions & 0 deletions bash/download.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,9 @@ source $ROOT_DIR/bash/utils/set_environment_variables.sh
# Setting Environmental Variables
set_envars

# set alias
mc alias set spaces $DO_SPACES_ENDPOINT $DO_SPACES_ACCESS_KEY $DO_SPACES_SECRET_KEY

# Download CSV files from Digital Ocean file storage
DATA_DIRECTORY=.data/
mkdir -p ${DATA_DIRECTORY} && (
Expand Down
21 changes: 18 additions & 3 deletions compose.yml
Original file line number Diff line number Diff line change
@@ -1,12 +1,27 @@
services:
db:
build:
context: ./db
build:
context: db/.
environment:
- POSTGRES_USER=${BUILD_ENGINE_USER}
- POSTGRES_PASSWORD=${BUILD_ENGINE_PASSWORD}
- POSTGRES_DB=${BUILD_ENGINE_DB}
networks:
- data
ports:
- "8001:5432"
runner:
build:
context: .
env_file:
- .env
networks:
- data
volumes:
- ./db-volume:/var/lib/postgresql/data
- ./tests:/tests
- ./bash:/bash
- ./sql:/sql
networks:
data:
name: ae-zoning-api_data
external: true
2 changes: 1 addition & 1 deletion db/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM postgis/postgis:15-3.4
FROM postgres:15-bookworm

RUN apt update
RUN apt install -y postgresql-15-postgis-3
Expand Down
20 changes: 10 additions & 10 deletions sample.env
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
BUILD_ENGINE_HOST=
BUILD_ENGINE_PORT=
BUILD_ENGINE_USER=
BUILD_ENGINE_PASSWORD=
BUILD_ENGINE_DB=
BUILD_ENGINE_HOST=ae-data-flow-db-1
BUILD_ENGINE_PORT=5432
BUILD_ENGINE_USER=postgres
BUILD_ENGINE_PASSWORD=postgres
BUILD_ENGINE_DB=data-flow

DO_SPACES_ENDPOINT=
DO_SPACES_ACCESS_KEY=
DO_SPACES_SECRET_KEY=
DO_SPACES_BUCKET_DISTRIBUTIONS=edm-distributions

ZONING_API_HOST=
ZONING_API_PORT=
ZONING_API_USER=
ZONING_API_PASSWORD=
ZONING_API_DB=
ZONING_API_HOST=ae-zoning-api-db-1
ZONING_API_PORT=5432
ZONING_API_USER=postgres
ZONING_API_PASSWORD=postgres
ZONING_API_DB=zoning
16 changes: 0 additions & 16 deletions sample_local.env

This file was deleted.

Loading
Loading