Simple dbt project for testing out orchestration tools
The purpose of this repository is to provide a simple dbt project that can be used for testing out data pipeline orchestration (workflow management) tools like airflow, prefect and dagster. These tools do not inherently have anything to do with dbt, but I am interested in how they work with it.
dbt is a (great) tool for managing transformations on data that has been loaded to a database. Here I use the command line version, which is open source.
dbt has adaptors for most common databases. For this project I use BigQuery and the public Iowa Liquor Retail Sales dataset.
This project uses BigQuery as the database that dbt deploys a view to, under a project name of dbt-orchestration
.
My criteria:
- Text-based (code or configuration) data pipeline definitions
- Open source, "open core" or SAAS model
- Support for multiple types of databases
Before I tested out different tools, I decided on some qualitative categories to evaluate the tools on. They are:
- Configuration: How quickly and easily can I get something set up?
- Documentation: What type of documentation and support for the product is available?
- Scalability: I am testing a simple pipeline; would the tool be able to scale out to support an enterprise data platform?
- User experience: Would I want to use it every day?
- Monitoring: When something breaks is it easy to figure out what went wrong?
I will rate these on a scale of 1-5 stars.
Other considerations I didn't include in my primary list are: cost, data sources, destinations, scheduling, triggering, etc.
I created a two step data pipeline:
- Create a view in BigQuery off of the Iowa Liquor Sales dataset.
- Run tests on that view.
What follow are demonstrations of different ways that this data pipeline can be run.
The simplest way to orchestrate the dbt CLI is to do it manually! 😊 But even when using another tool, it is a good way to test things out.
-
Install python: I used version 3.8.10 as there currently appears to be an issue with dbt installation when using 3.9.* versions.
-
Clone this repository.
-
Create and activate a python virtual environment and install requirements:
python -m venv env .\env\Scripts\activate pip install -r requirements.txt dbt deps
-
Add a
profiles.yml
file to the~/.dbt/
directory. This defines the database and authentication method for dbt. To help find this directory you can run the commanddbt debug --config-dir
. The profiles file should contain this:bigquery: target: development outputs: development: type: bigquery method: oauth project: dbt-orchestration #Use the project-id here dataset: iowa_liquor_sales threads: 1 timeout_seconds: 300 location: US priority: interactive retries: 1
Update the
project
to match your Google Cloud Platform project id. -
Set up local authentication using gcloud so you can connect to BigQuery.
This is as simple as running the dbt run
command. Check your project in BigQuery; under your project you should see a new dataset named iowa-liquor-sales
and a view named sales_by_year_month_product
.
Run dbt test
to run the schema tests defined in the iowa_liquor_sales.yml file.
Category | Rating | Comments |
---|---|---|
Configuration | ⭐⭐ | Fairly quick to set up a simple pipeline; would not support a large solution. |
Documentation | ⭐ | As good as you make it since you'd be writing it youself 😛. |
Scalability | ⭐ | Would take much effort to scale. |
User Experience | ⭐⭐ | Huge flexibility, but not a great experience for everyone. |
Monitoring | ⭐⭐ | Since you run it manually, you literally have someone monitoring the pipeline, but not really scalable... |
Takeaway: Not a scalable solution, but a great way to test out commands.
Follow the same setup steps for the manual method above.
The pipeline is defined in the dagster_run_dbt.py file. A dbt_cli_resource
is created and a "solid" is defined as a function that runs dbt by calling the run
method. A second solid is created to run the dbt tests, dependent on the first solid. Finally a pipeline is defined as running these two solids.
dagster can be run via the cli, via dagit (a UI) or via an api, which allows you to script it. I tried running via the cli and dagit.
To run via the cli:
dagster pipeline execute -f dagster_run_dbt.py
To run via dagit:
dagit -f dagster_run_dbt.py
Category | Rating | Comments |
---|---|---|
Configuration | ⭐⭐⭐ | Having never used dagster before, it took some reading to understand the basics, but they provided an example, so it was fairly easy to get up and running. |
Documentation | ⭐⭐⭐ | dagster is fairly new, so there doesn't seem to be an excess of documentation and examples available, but enough to get going. |
Scalability | ⭐⭐⭐⭐ | I think this could scale well. |
User Experience | ⭐⭐⭐⭐ | Having options for both the cli and the dagit UI is nice. |
Monitoring | ⭐⭐⭐ | I didn't dig into monitoring possibilities too much, but it seems to offer a lot of flexibility. |
Takeaway: I'd like to learn more about dagster! It seems like a promising tool, but would need to learn more about before using it with dbt.