This is one of two repositories with code for the entire DEB Project. While this repository focuses on the application code such as Airflow DAGs, the DEB Infrastructure repository focuses on provisioning cloud resources. This separation of concerns via separate repositories aims to follow GitOps Principles.
This repository is part of the Automated and Scalable Data Platform project, developed as a capstone for the Google Africa Data Engineering Bootcamp (Q3-2023). The project implements an end-to-end data pipeline to analyze user behavior in the context of movie reviews, leveraging modern data engineering tools and cloud platforms.
This application repository focuses solely on the Airflow application for running the data pipelines. Airflow is deployed on Google Kubernetes Engine (GKE) using the Kubernetes Executor for scalability and performance.
The required infrastructure (e.g., Kubernetes cluster, GCS buckets, BigQuery datasets) is provisioned via the DEB Infrastructure Repository.
The goal of this project is to design and implement an end-to-end solution for collecting, processing, and analyzing movie review data, providing actionable insights through a robust data pipeline. The project involves integrating multiple data sources, applying business rules, and delivering insights via an OLAP-ready data warehouse.
-
Data Integration: Collect data from external vendors and internal sources, including:
- User purchase records from PostgreSQL.
- Daily review data in CSV files.
- Log session metadata in CSV files.
-
Data Processing:
- Clean and transform raw data.
- Classify sentiments in movie reviews using a custom Hugging Face model.
- Generate dimensional and fact tables in BigQuery for downstream analytics.
-
Pipeline Orchestration:
- Automate ETL/ELT workflows using Apache Airflow.
-
Business Insights:
- Create an OLAP-ready
fact_movie_analytics
table for querying insights such as:- Device usage trends.
- Geographic distribution of reviews.
- Sentiment analysis outcomes.
- Create an OLAP-ready
- Orchestration: Apache Airflow manages and schedules all pipeline tasks.
- Data Processing:
- PySpark on Dataproc for large-scale transformations.
- Sentiment analysis using a fine-tuned BERT model.
- Storage:
- Google Cloud Storage (GCS) for raw and staged data.
- BigQuery for structured and analytical data.
- Data Modeling: dbt (data build tool) for dimensional modeling.
- CI/CD: GitHub Actions for automated deployments.
- Infrastructure as Code: Terraform for provisioning GCP resources.
-
Raw Data Ingestion:
- Load user purchase data into PostgreSQL.
- Upload movie reviews and log session data to GCS.
-
Data Processing:
- Transform data using PySpark.
- Perform sentiment analysis on reviews.
-
Staging and Modeling:
- Load cleaned data into BigQuery staging tables.
- Generate dimensional and fact tables using dbt.
-
Insights Delivery:
- Expose analytical queries through the
fact_movie_analytics
table.
- Expose analytical queries through the
- Programming Languages: Python, SQL
- Data Orchestration: Apache Airflow
- Data Processing: PySpark, Dataproc
- Data Modeling: dbt
- Storage: Google Cloud SQL (PostgreSQL), GCS, BigQuery
- Infrastructure as Code: Terraform
- CI/CD: GitHub Actions
- Sentiment Analysis: Hugging Face BERT model
The deployment and management of the application are automated through a GitHub Actions workflow. The steps are outlined below:
-
Build and Push Custom Airflow Image:
- The
build-push.yaml
workflow builds a custom Airflow image and pushes it to Google Artifact Registry. - The workflow then checks out the DEB Infrastructure repository and updates the
airflow-helm/values-dev.yaml
file with the new image and tag.
- The
-
Secrets Management:
- Configure GitHub secrets for:
- Your GitHub email and username.
- A Personal Access Token for repository-scoped access.
- Necessary credentials for GCP (Workload Identity Federation).
- Configure GitHub secrets for:
-
Trigger the Workflow:
- Commit changes to the
main
or feature branch and push them to GitHub. The CI/CD workflow will automatically deploy the updated Airflow instance and pipeline.
- Commit changes to the
-
Infrastructure Setup:
- The DEB Infrastructure repository manages the underlying cloud resources using Terraform.
-
Access the Airflow UI:
- Navigate to the external IP of the Airflow webserver as displayed in the Kubernetes Services list or via the
airflow-webserver
endpoint in the GCP console.
- Navigate to the external IP of the Airflow webserver as displayed in the Kubernetes Services list or via the
The analytics aspect of this project is centered on transforming raw user data, movie reviews, and session logs into actionable insights stored in an OLAP-ready fact_movie_analytics
table. This table is designed to support business intelligence tools and ad hoc queries for decision-making.
The fact_movie_analytics
table is the core analytical table in this pipeline. It aggregates metrics such as review scores, purchase amounts, and session details. The schema is defined as follows:
Column Name | Description |
---|---|
customerid |
Unique ID of the customer (from user purchase data). |
id_dim_devices |
Foreign key linking to the dim_devices table. |
id_dim_location |
Foreign key linking to the dim_location table. |
id_dim_os |
Foreign key linking to the dim_os table. |
amount_spent |
Total amount spent by the customer (calculated as quantity * unit_price ). |
review_score |
Aggregate sentiment score of the customer’s reviews. |
review_count |
Total number of reviews written by the customer. |
insert_date |
Timestamp of when the record was inserted into the fact table. |
The fact table references several dimension tables to provide additional context. These tables include:
Column Name | Description |
---|---|
id_dim_date |
Surrogate key for the dimension. |
log_date |
Date of the review session. |
day |
Day extracted from log_date . |
month |
Month extracted from log_date . |
year |
Year extracted from log_date . |
season |
Season inferred from the log_date . |
Column Name | Description |
---|---|
id_dim_devices |
Surrogate key for the dimension. |
device |
Type of device used for the review (e.g., mobile). |
Column Name | Description |
---|---|
id_dim_location |
Surrogate key for the dimension. |
location |
Geographic location of the reviewer. |
Column Name | Description |
---|---|
id_dim_os |
Surrogate key for the dimension. |
os |
Operating system of the device (e.g., Windows). |
Column Name | Description |
---|---|
id_dim_browser |
Surrogate key for the dimension. |
browser |
Browser used for the review session. |
The following are some example metrics and insights that can be derived from the data:
-
Review Activity:
- Total number of reviews across various locations.
- Device-specific review activity in different regions.
-
Geographic Analysis:
- States with the highest and lowest review activity.
- Distribution of reviews by device type in different U.S. regions.
-
Sentiment Insights:
- Overall positive vs. negative sentiment trends.
- Most positively reviewed locations or devices.
-
Spending Patterns:
- Correlation between spending patterns and review sentiment.
- Total revenue generated by customers who left positive reviews.
Here are a few sample SQL queries for deriving insights:
SELECT
l.location AS state,
d.device,
COUNT(f.review_count) AS total_reviews
FROM fact_movie_analytics f
JOIN dim_location l ON f.id_dim_location = l.id_dim_location
JOIN dim_devices d ON f.id_dim_devices = d.id_dim_devices
WHERE l.location IN ('California', 'New York', 'Texas')
GROUP BY l.location, d.device
ORDER BY total_reviews DESC;
SELECT
l.location AS state,
SUM(f.review_score) AS positive_reviews
FROM fact_movie_analytics f
JOIN dim_location l ON f.id_dim_location = l.id_dim_location
GROUP BY l.location
ORDER BY positive_reviews DESC;
SELECT
f.customerid,
SUM(f.amount_spent) AS total_spent,
SUM(f.review_score) AS total_positive_reviews
FROM fact_movie_analytics f
GROUP BY f.customerid
ORDER BY total_spent DESC;
SELECT
l.location AS region,
d.device,
COUNT(f.review_count) AS review_count
FROM fact_movie_analytics f
JOIN dim_location l ON f.id_dim_location = l.id_dim_location
JOIN dim_devices d ON f.id_dim_devices = d.id_dim_devices
GROUP BY l.location, d.device
ORDER BY review_count DESC;
SELECT
f.customerid,
SUM(f.amount_spent) AS total_spent
FROM fact_movie_analytics f
GROUP BY f.customerid
ORDER BY total_spent DESC
LIMIT 10;
SELECT
d.year,
d.month,
SUM(f.review_score) AS positive_reviews
FROM fact_movie_analytics f
JOIN dim_date d ON f.insert_date = d.log_date
GROUP BY d.year, d.month
ORDER BY d.year, d.month;
SELECT
l.location AS state,
AVG(f.amount_spent) AS avg_spending
FROM fact_movie_analytics f
JOIN dim_location l ON f.id_dim_location = l.id_dim_location
GROUP BY l.location
ORDER BY avg_spending DESC;