This is a project to create a data pipeline that extracts data from a dataset, transforms it and loads it into a data warehouse. The data is then visualized using PowerBI. It was done as a final project for the Data Engineering course at DataTalksClub.
This project is a demostration of how to use Apache Airflow, Terraform, Google Cloud Storage, Google BigQuery and PowerBI to create a data pipeline. It can be greatly improved but should suffice as a proof of skills. :)
Number of passengers per turnstile at each station in 15-minute ranges and discriminating according to the type of ticket for the year 2023 in Buenos Aires, Argentina.
The data pipeline extracts the data from the dataset, transforms it to a more suitable format and loads it partitioned into a data warehouse. The data is then modeled using dbt and visualized using PowerBI.
Python - Programming language
Docker - Containerization
Terraform - Infrastructure as code
Apache Airflow - Data pipeline orchestration
Google Cloud Storage - Data storage
Google BigQuery - Data warehouse
Dbt - Data modeling
PowerBI - Data visualization
- Clone the repository
- Run
docker-compose up
to start the Airflow server - Access
localhost:8080
to access the Airflow UI withairflow:airflow
as user/password - Modify terraform files to set up the infrastructure as needed
- Run
terraform init
to initialize the terraform project - Run
terraform apply
to create the infrastructure - Trigger the
caba_etl
DAG - Access the BigQuery console and create a new external table named
data
based on the partitioned data in GCS - Using
dbt
, build the model - Data is ready to be used in PowerBI or similar tools
-
GCS service account key file required. Needs to be placed inside ./airflow/tmp folder with name "key.json".
-
The final PowerBI report is available in the
reports
folder in PDF format. Sourcepbix
is also provided.