Mini Project 11: Data Pipeline Using Databricks

Project Overview

This project demonstrates the creation of an ETL (Extract, Transform, Load) pipeline using Databricks. The goal of the pipeline is to:

Extract data from a data source (a CSV file).
Perform transformations to clean and enrich the data.
Load the processed data into a data sink for storage and further use.

Key Concepts

Data Source: The starting point of the pipeline where raw data is stored (in this case, the CSV file nba_games_stats.csv).
Data Transformations: Intermediate steps where the raw data is cleaned, structured, and enriched.
Data Sink: The final destination for the processed data (in this project, a transformed CSV file stored in the Databricks FileStore).
Data Pipeline: The entire process that connects the data source, transformations, and the data sink.

Pipeline Steps

1. Extract

Description: Data is extracted from the CSV file nba_games_stats.csv.
Implementation: The file is read into a Spark DataFrame using the spark.read.csv() function.
File Location: dbfs:/FileStore/tables/nmk_43_pipeline/nba_games_stats.csv.

2. Transform

Description: The extracted data undergoes the following transformations:
1. Clean column names to ensure consistency.
2. Calculate the point difference between teams (PointDifference column).
3. Identify the winner of each game (Winner column).
4. Classify games as home or away (HomeGame and AwayGame columns).

3. Load

Description: The transformed data is saved as a CSV file to the Databricks FileStore for further use.
File Location: dbfs:/FileStore/tables/nmk_43_pipeline/transformed_output.

Project Structure

.
├── dataset/
│   └── nba_games_stats.csv      # Data source: basketball game statistics
├── project/
│   ├── pipeline.py              # Main ETL pipeline script
│   └── transformed_output/      # Folder for processed data output
├── README.md                    # Project documentation
├── Makefile                     # Makefile for CI/CD or setup tasks
├── requirements.txt             # Python dependencies
└── setup.sh                     # Shell script for environment setup

How to Run the Pipeline

1. Prerequisites

A running Databricks cluster with access to the Databricks FileStore.

2. Steps to Execute

Upload the Dataset:
- Navigate to the Data tab in Databricks.
- Upload the file nba_games_stats.csv to the FileStore at /FileStore/tables/.
Run the Notebook:
- Create a notebook in Databricks and copy the pipeline.py code.
- Attach the notebook to your cluster.
- Run the notebook cells step by step.
Verify the Output:
- Transformed data is saved to dbfs:/FileStore/tables/nmk_43_pipeline/transformed_output.

Proof of Pipeline Success

Proof of script running smoothly and file being saved Success Run
Proof of storage Storage Success

Cluster Settings

Cluster Name: Nzarama Kouadio's Cluster
Runtime: Databricks Runtime 16.0.x-cpu-ml-scala2.12
Driver Type: Standard_DS3_v2
Worker Type: Standard_DS3_v2

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
Pic_Proof		Pic_Proof
dataset		dataset
project		project
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
repeat.sh		repeat.sh
requirements.txt		requirements.txt
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mini Project 11: Data Pipeline Using Databricks

Project Overview

Key Concepts

Pipeline Steps

1. Extract

2. Transform

3. Load

Project Structure

How to Run the Pipeline

1. Prerequisites

2. Steps to Execute

Proof of Pipeline Success

Cluster Settings

About

Releases

Packages

Languages

nogibjj/Nzarama_Kouadio_DE_Mini_Project11

Folders and files

Latest commit

History

Repository files navigation

Mini Project 11: Data Pipeline Using Databricks

Project Overview

Key Concepts

Pipeline Steps

1. Extract

2. Transform

3. Load

Project Structure

How to Run the Pipeline

1. Prerequisites

2. Steps to Execute

Proof of Pipeline Success

Cluster Settings

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages