This project demonstrates the creation of an ETL (Extract, Transform, Load) pipeline using Databricks. The goal of the pipeline is to:
- Extract data from a data source (a CSV file).
- Perform transformations to clean and enrich the data.
- Load the processed data into a data sink for storage and further use.
- Data Source: The starting point of the pipeline where raw data is stored (in this case, the CSV file
nba_games_stats.csv
). - Data Transformations: Intermediate steps where the raw data is cleaned, structured, and enriched.
- Data Sink: The final destination for the processed data (in this project, a transformed CSV file stored in the Databricks FileStore).
- Data Pipeline: The entire process that connects the data source, transformations, and the data sink.
- Description: Data is extracted from the CSV file
nba_games_stats.csv
. - Implementation: The file is read into a Spark DataFrame using the
spark.read.csv()
function. - File Location:
dbfs:/FileStore/tables/nmk_43_pipeline/nba_games_stats.csv
.
- Description: The extracted data undergoes the following transformations:
- Clean column names to ensure consistency.
- Calculate the point difference between teams (
PointDifference
column). - Identify the winner of each game (
Winner
column). - Classify games as home or away (
HomeGame
andAwayGame
columns).
- Description: The transformed data is saved as a CSV file to the Databricks FileStore for further use.
- File Location:
dbfs:/FileStore/tables/nmk_43_pipeline/transformed_output
.
.
├── dataset/
│ └── nba_games_stats.csv # Data source: basketball game statistics
├── project/
│ ├── pipeline.py # Main ETL pipeline script
│ └── transformed_output/ # Folder for processed data output
├── README.md # Project documentation
├── Makefile # Makefile for CI/CD or setup tasks
├── requirements.txt # Python dependencies
└── setup.sh # Shell script for environment setup
- A running Databricks cluster with access to the Databricks FileStore.
-
Upload the Dataset:
- Navigate to the Data tab in Databricks.
- Upload the file
nba_games_stats.csv
to the FileStore at/FileStore/tables/
.
-
Run the Notebook:
- Create a notebook in Databricks and copy the
pipeline.py
code. - Attach the notebook to your cluster.
- Run the notebook cells step by step.
- Create a notebook in Databricks and copy the
-
Verify the Output:
- Transformed data is saved to
dbfs:/FileStore/tables/nmk_43_pipeline/transformed_output
.
- Transformed data is saved to
-
Proof of script running smoothly and file being saved Success Run
-
Proof of storage Storage Success
- Cluster Name: Nzarama Kouadio's Cluster
- Runtime: Databricks Runtime 16.0.x-cpu-ml-scala2.12
- Driver Type: Standard_DS3_v2
- Worker Type: Standard_DS3_v2