jaffle_shop
is a fictional ecommerce store. This project pulls data from a bucket into sqlite db, transforms raw data from an app database into a customers and orders model ready for analytics. It then visualizes the data in grafana.
What this repo is:
- A self-contained playground end to end data engineering project with airflow, dbt project, useful for testing out scripts, and communicating some of the core engineering concepts.
This repo contains seeds that includes some (fake) raw data from a fictional app along with some basic dbt models, tests, and docs for this data.
The raw data consists of customers, orders, and payments, with the following entity-relationship diagram:
If you're just starting your cloud data warehouse journey and are hungry to get started with dbt before your organization officially gets a data warehouse, you should check out this repo.
If you want to run 28 SQL operations with dbt in less than 1 second
, for free, and all on your local machine, you should check out this repo.
If you want an adrenaline rush from a process that used to take dbt newcomers 1 hour
and is now less than 1 minute
, you should check out this repo.
Verified GitHub Action on dbt Performance
Prerequisities: Python >= 3.5
To get up and running with this project:
- Clone this repository.
git clone https://github.com/dbt-labs/jaffle_shop_duckdb.git
-
Change into the
jaffle_shop_duck
directory from the command line:cd jaffle_shop_duckdb
-
Install dbt and DuckDB in a virtual environment.
Expand your shell below:
POSIX bash/zsh
python3 -m venv venv source venv/bin/activate python3 -m pip install --upgrade pip python3 -m pip install -r requirements.txt source venv/bin/activate
POSIX fish
python3 -m venv venv source venv/bin/activate.fish python3 -m pip install --upgrade pip python3 -m pip install -r requirements.txt source venv/bin/activate.fish
POSIX csh/tcsh
python3 -m venv venv source venv/bin/activate.csh python3 -m pip install --upgrade pip python3 -m pip install -r requirements.txt source venv/bin/activate.csh
POSIX PowerShell Core
python3 -m venv venv venv/bin/Activate.ps1 python3 -m pip install --upgrade pip python3 -m pip install -r requirements.txt venv/bin/Activate.ps1
Windows cmd.exe
python -m venv venv venv\Scripts\activate.bat python -m pip install --upgrade pip python -m pip install -r requirements.txt venv\Scripts\activate.bat
Windows PowerShell
python -m venv venv venv\Scripts\Activate.ps1 python -m pip install --upgrade pip python -m pip install -r requirements.txt venv\Scripts\Activate.ps1
Why a 2nd activation of the virtual environment?
This may not be necessary for many users, but might be for some. Read on for a first-person report from @dbeatty10.
I use
zsh
as my shell on my MacBook Pro, and I usepyenv
to manage my Python environments. I already had an alpha version of dbt Core 1.2 installed (and yet another via pipx):$ which dbt /Users/dbeatty/.pyenv/shims/dbt
$ dbt --version Core: - installed: 1.2.0-a1 - latest: 1.1.1 - Ahead of latest version! Plugins: - bigquery: 1.2.0a1 - Ahead of latest version! - snowflake: 1.2.0a1 - Ahead of latest version! - redshift: 1.2.0a1 - Ahead of latest version! - postgres: 1.2.0a1 - Ahead of latest version!
Then I ran all the steps to create a virtual environment and install the requirements of our DuckDB-based Jaffle Shop repo:
$ python3 -m venv venv $ source venv/bin/activate (venv) $ python3 -m pip install --upgrade pip (venv) $ python3 -m pip install -r requirements.txt
Let's examine where
dbt
is installed and which version it is reporting:(venv) $ which dbt /Users/dbeatty/projects/jaffle_duck/venv/bin/dbt
(venv) $ dbt --version Core: - installed: 1.2.0-a1 - latest: 1.1.1 - Ahead of latest version! Plugins: - bigquery: 1.2.0a1 - Ahead of latest version! - snowflake: 1.2.0a1 - Ahead of latest version! - redshift: 1.2.0a1 - Ahead of latest version! - postgres: 1.2.0a1 - Ahead of latest version!
❌ That isn't what we expected -- something isn't right. 😢
So let's reactivate the virtual environment and try again...
(venv) $ source venv/bin/activate
(venv) $ dbt --version Core: - installed: 1.1.1 - latest: 1.1.1 - Up to date! Plugins: - postgres: 1.1.1 - Up to date! - duckdb: 1.1.3 - Up to date!
✅ This is what we want -- the 2nd reactivation worked. 😎
-
get the absolute path of this repo, going to use it a lot:
echo $(pwd) # /path/to/this/repo/local-data-eng-workshop
- Install Airflow
# Change to path of the repo with airflow at the end i.e /path/to/this/repo/local-data-eng-workshop/airflow
export AIRFLOW_HOME=$(pwd)/airflow
AIRFLOW_VERSION=2.10.3
PYTHON_VERSION="$(python -c 'import sys; print(f"{sys.version_info.major}.{sys.version_info.minor}")')"
CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"
-
Ensure your profile is setup correctly from the command line:
dbt --version dbt debug
-
run airflow
airflow standalone
Then go to the url localhost:8080. It will ask for username and password. The username should be admin, the password is randomly generated and can be found when searching password
in the logs for command used above.
- Close airflow, there should be a new folder called
airflow
. Now open the fileairflow/airflow.cfg
and change variabledags_folder
to folder dags in top directory i.e:
dags_folder = /path/to/this/repo/local-data-eng-workshop/dags
then go into file /path/to/this/repo/local-data-eng-workshop/dags/upload_files.py
and replace the variable RepoBasePath
with repo path i.e
RepoBasePath = "/path/to/this/repo/local-data-eng-workshop"
Then restart airflow
airflow standalone
and search for the dag gcs_to_db and run it and monitor it on localhost:8080
-
Now we are going to run out dbt models, and test the output of the models using the dbt build command:
dbt build
-
To query the data best best is install the sqlite viewer extension in vscode and open the file
jaffle_shop.db
in the root of this repo.
-
Generate and view the documentation for the project:
dbt docs generate dbt docs serve
-
Now to visualize the data, go to grafana site here and install grafana.
When yo ustart the service, go to your browser and type localhost:3000
and login with the default username and password admin
and admin
. Then add a new data source, select sqlite and input the path to the duckdb file i.e /path/to/this/repo/local-data-eng-workshop/jaffle_shop.db
and save and test the connection.
Then go to the explore tab and input the query select * from customers
and you should see the data.
Sticked together using following links: https://airflow.apache.org/docs/apache-airflow/stable/start.html https://github.com/dbt-labs/jaffle_shop_duckdb/tree/duckdb
Some options:
You may get an error like this, in which case you will need to disconnect from any sessions that are locking the database:
IO Error: Could not set lock on file "jaffle_shop.duckdb": Resource temporarily unavailable
This is a known issue in DuckDB. If you are using DBeaver, this means shutting down DBeaver (merely disconnecting didn't work for me).
Very worst-case, deleting the database file will get you back in action (BUT you will lose all your data).