├── .devcontainer/
│ ├── devcontainer.json
│ └── Dockerfile
├── .github/
│ └── workflows/
│ └──cicd.yml
├── mylib/
├── Notebook_Folder/
├── visuals
├── .gitignore
├── main.py
├── test_main.py
├── visuals
├── Makefile
└── README.md
This project aims to create a Databricks pipeline using Databricks.
This project implements a Databricks ETL pipeline for retrieving and processing an airline safety dataset. Key features include a well-documented ETL notebook, Delta Lake for storage, Spark SQL for data transformations, and robust error handling with data validation. It incorporates data visualizations for insights and leverages an automated Databricks API trigger for continuous processing.
The workflow utilizes a Makefile to automate tasks such as installation, testing, formatting (Python Black), linting (Ruff), and an all-in-one operation via GitHub Actions, enhancing efficiency and code quality.
The dataset, airline-safety.csv, sourced from the Aviation Safety Network, contains safety records for 56 airlines. It details seat kilometers flown weekly and segregates incidents, fatal accidents, and fatalities into two periods: 1985–1999 and 2000–2014.
- Connect GitHub account to Databricks Workspace
- Create global init script for cluster start to store environment variables
- Establishes a connection to the Databricks environment using environment variables for authentication (SERVER_HOSTNAME, ACCESS_TOKEN and JOB_ID).
- Create a Databricks cluster that supports Pyspark
- Clone Github repo into Databricks workspace
- Create a job on Databricks to build an ETL pipeline
- Run the job and ensure that it has been completed (we should see the picture below)
- Push to github