The pipeline is defined in meltano.yml
and executed via Dockerfile
. It comprises the following steps:
- Load the data from a MySQL database (which is presumed to already have been imported - see below),
- Store a subset of the data in a DuckDB database,
- Run a series of transformations on the data in DuckDB via dbt,
- Output just the achievements in a final DuckDB database (still via the dbt transformation step in 3).
To feed the initial data into the MySQL database.
- Supply a SQL dump file in
/data/
- the filename of the SQL file can be anything, but make sure that:- It is the only
.sql
file in the directory, - It loads data into the
debatovanicz
MySQL database, because that's where the pipeline expects the data to be.
- It is the only
- Remove any existing Docker volumes that might be left over from previous runs. Specifically,
greybox_wrapped_mysql_storage
is where the MySQL container stores its data between container runs. You can do this withdocker volume rm greybox_wrapped_mysql_storage
. (You might need to stop the MySQL container first withdocker-compose -f docker-compose.data_prep.yml down
.) - Now run
docker -f docker-compose.data_prep.yml up
to start the pipeline. This will:- Start a MySQL container and load the data from the SQL dump file into it,
- Start a Meltano container and run the pipeline steps as described above.