This repo shows an end-to-end example on how to use dbt to model a number of BigQuery datasets. A high level overview of the solution is provided on the README.md file.
Follow the steps in this guide to deploy your own BigQuery data model using dbt!
Click the Start button to move to the next step.
From Cloud Shell, execute the following to install 1.1.7 of Terraform:
export TERRAFORM_VERSION="1.1.7"
curl https://releases.hashicorp.com/terraform/${TERRAFORM_VERSION}/terraform_${TERRAFORM_VERSION}_linux_amd64.zip > terraform_${TERRAFORM_VERSION}_linux_amd64.zip && \
sudo unzip -o terraform_${TERRAFORM_VERSION}_linux_amd64.zip -d /usr/local/bin && \
rm -f terraform_${TERRAFORM_VERSION}_linux_amd64.zip
Verify that this is installed correctly by executing:
terraform -v
From Cloud Shell, execute the following to install dbt:
pip3 uninstall dbt-core
pip3 install --user --upgrade dbt-bigquery
Verify that this is installed correctly by executing:
dbt --version
Execute the following to set an environment variable that will be used throughout the rest of these instructions
export PROJECT_ROOT=$(pwd)
Open the terraform.tfvars.renameMe file.
The file should be in this format, the fields to fill in are quite self-explanatory:
project_prefix = "bigquery-dbt-bank"
billing_account_id = ""
bigquery_location = "US"
When you have finished editing this file, save a copy of this file as terraform.tfvars
within the terraform directory.
cp $PROJECT_ROOT/terraform/terraform.tfvars.renameMe $PROJECT_ROOT/terraform/terraform.tfvars
Create a new GCP project, enable the BigQuery API, and then create a dataset named raw
cd $PROJECT_ROOT/terraform
terraform init
terraform apply --auto-approve
Note that the project created will have a project id whose prefix is project_prefix
as defined in terraform.tfvars
, and a random suffix. The project id will be shown to you after you run terraform apply
.
Take note of the project id that was created. You will need this in the next step.
cd $PROJECT_ROOT/terraform
terraform output project_id
Create a new file in ~/.dbt/profiles.yml if it doesn't exist already. Add the following code block to configure this project's connection to BigQuery:
bigquery_bank:
outputs:
dev:
dataset: bigquery_bank
fixed_retries: 1
location: <bigquery_location>
method: oauth
priority: batch
project: <project_id>
threads: 8
timeout_seconds: 300
type: bigquery
target: dev
Be sure to change the project
and the location
to correspond to the project id created by terraform (including the randomized suffix) and the location of the BigQuery dataset as configured in terraform.tfvars
.
Install dbt dependencies, as indicated in the dbt/packages.yml file.
cd $PROJECT_ROOT/dbt
dbt clean
dbt deps
Execute the following:
cd $PROJECT_ROOT/dbt
dbt run-operation stage_external_sources
This takes the GCS files listed in dbt/models/raw/external_tables.yml, and creates corresponding external tables in BigQuery within the raw
dataset.
For this tutorial, the data will come from a public Google Cloud Storage bucket gs://williamtsoi-bigquery-bank-demo/
. In your use case you may use data that has been Extract-Loaded by your platform of choice.
Execute the following:
cd $PROJECT_ROOT/dbt
dbt compile
This will execute the jinja macros from the dbt/analyses
directory, such as dbt/analyses/stg_card_transactions.sql:
{{ codegen.generate_base_model(
source_name='raw',
table_name='ext_card_transactions'
) }}
This macro will inspect the table schema of the ext_card_transactions
external table in the raw
dataset, and will generate a base SQL model for the staging schema.
The generated SQL models will be placed in the dbt/target/compiled/bigquery_bank/analyses
directory. For example, the corresponding base schema generated for the ext_card_transactions
table is:
with source as (
select * from {{ source('raw', 'ext_card_transactions') }}
),
renamed as (
select
cc_number,
trans_id,
trans_time,
epoch_time,
category,
merchant,
merchant_lat,
merchant_lon,
amount,
is_fraud,
trans_date
from source
)
select * from renamed
Normally what happens here is that you will need to model the staging schema in the dbt/models/staging
directory, based on the generated base schema files in dbt/target/compiled/bigquery_bank/analyses
in the previous step.
This is the place where you will define partition/clustering for the tables. For example, in dbt/models/staging/stg_card_transactions.sql, we have added an additional config block on the top of the model file to define partitioning by the trans_date
column:
{{ config(
materialized='table',
partition_by={
"field": "trans_date",
"data_type": "date",
"granularity": "day"
}
)}}
with source as (
select * from {{ source('raw', 'ext_card_transactions') }}
),
renamed as (
select
cc_number,
trans_id,
trans_time,
epoch_time,
category,
merchant,
merchant_lat,
merchant_lon,
amount,
is_fraud,
trans_date
from source
)
select * from renamed
For the purpose of this tutorial though, the staging schemas have already been defined and so there's no action required here.
Deploy the staging schema into BigQuery by executing:
cd $PROJECT_ROOT/dbt
dbt run
Execute the data tests in BigQuery by executing:
cd $PROJECT_ROOT/dbt
dbt test
Tests are defined in the dbt/models/staging/stg_bigquery_bank.yml
, as well as dbt/tests/*.sql
.
dbt comes with some powerful documentation generation facilities. Once the data model has been deployed, you can generate a document website using the following command.
cd $PROJECT_ROOT/dbt
dbt docs generate
This will generate the data catalog json file.
Now to serve the website, run the following commands:
cd $PROJECT_ROOT/dbt
dbt docs serve
Next, click on the web preview icon at the top right of the window and click "Preview on port 8080".
When you are done browsing the documentation website, ensure that you press Ctrl-C
to stop the web server and to bring you back to the shell.
Once you are done with this tutorial, run the following commands to destroy all the resources:
cd $PROJECT_ROOT/terraform
terraform destroy --auto-approve