🗜️⚙️ DEB Application Repository 🔩🧰

This is one of two repositories with code for the entire DEB Project. While this repository focuses on the application code such as Airflow DAGs, the DEB Infrastructure repository focuses on provisioning cloud resources. This separation of concerns via separate repositories aims to follow GitOps Principles.

🎥 Movie Analytics Data Platform 🎬

This repository is part of the Automated and Scalable Data Platform project, developed as a capstone for the Google Africa Data Engineering Bootcamp (Q3-2023). The project implements an end-to-end data pipeline to analyze user behavior in the context of movie reviews, leveraging modern data engineering tools and cloud platforms.

Key Note

This application repository focuses solely on the Airflow application for running the data pipelines. Airflow is deployed on Google Kubernetes Engine (GKE) using the Kubernetes Executor for scalability and performance.

The required infrastructure (e.g., Kubernetes cluster, GCS buckets, BigQuery datasets) is provisioned via the DEB Infrastructure Repository.

Project Overview

Goal

The goal of this project is to design and implement an end-to-end solution for collecting, processing, and analyzing movie review data, providing actionable insights through a robust data pipeline. The project involves integrating multiple data sources, applying business rules, and delivering insights via an OLAP-ready data warehouse.

Key Objectives

Data Integration: Collect data from external vendors and internal sources, including:
- User purchase records from PostgreSQL.
- Daily review data in CSV files.
- Log session metadata in CSV files.
Data Processing:
- Clean and transform raw data.
- Classify sentiments in movie reviews using a custom Hugging Face model.
- Generate dimensional and fact tables in BigQuery for downstream analytics.
Pipeline Orchestration:
- Automate ETL/ELT workflows using Apache Airflow.
Business Insights:
- Create an OLAP-ready fact_movie_analytics table for querying insights such as:
  - Device usage trends.
  - Geographic distribution of reviews.
  - Sentiment analysis outcomes.

Architecture

Components

Orchestration: Apache Airflow manages and schedules all pipeline tasks.
Data Processing:
- PySpark on Dataproc for large-scale transformations.
- Sentiment analysis using a fine-tuned BERT model.
Storage:
- Google Cloud Storage (GCS) for raw and staged data.
- BigQuery for structured and analytical data.
Data Modeling: dbt (data build tool) for dimensional modeling.
CI/CD: GitHub Actions for automated deployments.
Infrastructure as Code: Terraform for provisioning GCP resources.

Pipeline Flow

Raw Data Ingestion:
- Load user purchase data into PostgreSQL.
- Upload movie reviews and log session data to GCS.
Data Processing:
- Transform data using PySpark.
- Perform sentiment analysis on reviews.
Staging and Modeling:
- Load cleaned data into BigQuery staging tables.
- Generate dimensional and fact tables using dbt.
Insights Delivery:
- Expose analytical queries through the fact_movie_analytics table.

Tech Stack

Programming Languages: Python, SQL
Data Orchestration: Apache Airflow
Data Processing: PySpark, Dataproc
Data Modeling: dbt
Storage: Google Cloud SQL (PostgreSQL), GCS, BigQuery
Infrastructure as Code: Terraform
CI/CD: GitHub Actions
Sentiment Analysis: Hugging Face BERT model

How to Run

CI/CD Workflow

The deployment and management of the application are automated through a GitHub Actions workflow. The steps are outlined below:

Build and Push Custom Airflow Image:
- The build-push.yaml workflow builds a custom Airflow image and pushes it to Google Artifact Registry.
- The workflow then checks out the DEB Infrastructure repository and updates the airflow-helm/values-dev.yaml file with the new image and tag.
Secrets Management:
- Configure GitHub secrets for:
  - Your GitHub email and username.
  - A Personal Access Token for repository-scoped access.
  - Necessary credentials for GCP (Workload Identity Federation).
Trigger the Workflow:
- Commit changes to the main or feature branch and push them to GitHub. The CI/CD workflow will automatically deploy the updated Airflow instance and pipeline.
Infrastructure Setup:
- The DEB Infrastructure repository manages the underlying cloud resources using Terraform.
Access the Airflow UI:
- Navigate to the external IP of the Airflow webserver as displayed in the Kubernetes Services list or via the airflow-webserver endpoint in the GCP console.

Analytics

Overview

The analytics aspect of this project is centered on transforming raw user data, movie reviews, and session logs into actionable insights stored in an OLAP-ready fact_movie_analytics table. This table is designed to support business intelligence tools and ad hoc queries for decision-making.

Fact Table: `fact_movie_analytics`

The fact_movie_analytics table is the core analytical table in this pipeline. It aggregates metrics such as review scores, purchase amounts, and session details. The schema is defined as follows:

Column Name	Description
`customerid`	Unique ID of the customer (from user purchase data).
`id_dim_devices`	Foreign key linking to the `dim_devices` table.
`id_dim_location`	Foreign key linking to the `dim_location` table.
`id_dim_os`	Foreign key linking to the `dim_os` table.
`amount_spent`	Total amount spent by the customer (calculated as `quantity * unit_price`).
`review_score`	Aggregate sentiment score of the customer’s reviews.
`review_count`	Total number of reviews written by the customer.
`insert_date`	Timestamp of when the record was inserted into the fact table.

Dimension Tables

The fact table references several dimension tables to provide additional context. These tables include:

1. `dim_date`

Column Name	Description
`id_dim_date`	Surrogate key for the dimension.
`log_date`	Date of the review session.
`day`	Day extracted from `log_date`.
`month`	Month extracted from `log_date`.
`year`	Year extracted from `log_date`.
`season`	Season inferred from the `log_date`.

2. `dim_devices`

Column Name	Description
`id_dim_devices`	Surrogate key for the dimension.
`device`	Type of device used for the review (e.g., mobile).

3. `dim_location`

Column Name	Description
`id_dim_location`	Surrogate key for the dimension.
`location`	Geographic location of the reviewer.

4. `dim_os`

Column Name	Description
`id_dim_os`	Surrogate key for the dimension.
`os`	Operating system of the device (e.g., Windows).

5. `dim_browser`

Column Name	Description
`id_dim_browser`	Surrogate key for the dimension.
`browser`	Browser used for the review session.

Key Metrics

The following are some example metrics and insights that can be derived from the data:

Review Activity:
- Total number of reviews across various locations.
- Device-specific review activity in different regions.
Geographic Analysis:
- States with the highest and lowest review activity.
- Distribution of reviews by device type in different U.S. regions.
Sentiment Insights:
- Overall positive vs. negative sentiment trends.
- Most positively reviewed locations or devices.
Spending Patterns:
- Correlation between spending patterns and review sentiment.
- Total revenue generated by customers who left positive reviews.

Example Queries

Here are a few sample SQL queries for deriving insights:

1. Reviews by State and Device

SELECT 
    l.location AS state, 
    d.device, 
    COUNT(f.review_count) AS total_reviews
FROM fact_movie_analytics f
JOIN dim_location l ON f.id_dim_location = l.id_dim_location
JOIN dim_devices d ON f.id_dim_devices = d.id_dim_devices
WHERE l.location IN ('California', 'New York', 'Texas')
GROUP BY l.location, d.device
ORDER BY total_reviews DESC;

2. Positive Reviews by Location

SELECT 
    l.location AS state,
    SUM(f.review_score) AS positive_reviews
FROM fact_movie_analytics f
JOIN dim_location l ON f.id_dim_location = l.id_dim_location
GROUP BY l.location
ORDER BY positive_reviews DESC;

3. Revenue and Sentiment Analysis

SELECT 
    f.customerid,
    SUM(f.amount_spent) AS total_spent,
    SUM(f.review_score) AS total_positive_reviews
FROM fact_movie_analytics f
GROUP BY f.customerid
ORDER BY total_spent DESC;

5. Device Popularity by Region

SELECT 
    l.location AS region,
    d.device,
    COUNT(f.review_count) AS review_count
FROM fact_movie_analytics f
JOIN dim_location l ON f.id_dim_location = l.id_dim_location
JOIN dim_devices d ON f.id_dim_devices = d.id_dim_devices
GROUP BY l.location, d.device
ORDER BY review_count DESC;

6. Highest Spending Customers

SELECT 
    f.customerid,
    SUM(f.amount_spent) AS total_spent
FROM fact_movie_analytics f
GROUP BY f.customerid
ORDER BY total_spent DESC
LIMIT 10;

7. Positive Review Trends Over Time

SELECT 
    d.year,
    d.month,
    SUM(f.review_score) AS positive_reviews
FROM fact_movie_analytics f
JOIN dim_date d ON f.insert_date = d.log_date
GROUP BY d.year, d.month
ORDER BY d.year, d.month;

8. State-Wise Average Spending

SELECT 
    l.location AS state,
    AVG(f.amount_spent) AS avg_spending
FROM fact_movie_analytics f
JOIN dim_location l ON f.id_dim_location = l.id_dim_location
GROUP BY l.location
ORDER BY avg_spending DESC;

Name		Name	Last commit message	Last commit date
Latest commit History 147 Commits
.github/workflows		.github/workflows
dags		dags
dbt/deb-capstone		dbt/deb-capstone
include		include
pyspark-scripts		pyspark-scripts
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
profiles.yml		profiles.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🗜️⚙️ DEB Application Repository 🔩🧰

🎥 Movie Analytics Data Platform 🎬

Key Note

Project Overview

Goal

Key Objectives

Architecture

Components

Pipeline Flow

Tech Stack

How to Run

CI/CD Workflow

Analytics

Overview

Fact Table: `fact_movie_analytics`

Dimension Tables

1. `dim_date`

2. `dim_devices`

3. `dim_location`

4. `dim_os`

5. `dim_browser`

Key Metrics

Example Queries

1. Reviews by State and Device

2. Positive Reviews by Location

3. Revenue and Sentiment Analysis

5. Device Popularity by Region

6. Highest Spending Customers

7. Positive Review Trends Over Time

8. State-Wise Average Spending

About

Releases

Packages

Languages

uche-madu/deb-application

Folders and files

Latest commit

History

Repository files navigation

🗜️⚙️ DEB Application Repository 🔩🧰

🎥 Movie Analytics Data Platform 🎬

Key Note

Project Overview

Goal

Key Objectives

Architecture

Components

Pipeline Flow

Tech Stack

How to Run

CI/CD Workflow

Analytics

Overview

Fact Table: fact_movie_analytics

Dimension Tables

1. dim_date

2. dim_devices

3. dim_location

4. dim_os

5. dim_browser

Key Metrics

Example Queries

1. Reviews by State and Device

2. Positive Reviews by Location

3. Revenue and Sentiment Analysis

5. Device Popularity by Region

6. Highest Spending Customers

7. Positive Review Trends Over Time

8. State-Wise Average Spending

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Fact Table: `fact_movie_analytics`

1. `dim_date`

2. `dim_devices`

3. `dim_location`

4. `dim_os`

5. `dim_browser`

Packages