Meron Gedrago miniproject Week 11

Structure for this project

├── .devcontainer/
│   ├── devcontainer.json
│   └── Dockerfile
├── .github/
│   └── workflows/
│        └──cicd.yml
├── mylib/
├── Notebook_Folder/
├── visuals
├── .gitignore
├── main.py
├── test_main.py
├── visuals
├── Makefile
└── README.md

Purpose of the project

This project aims to create a Databricks pipeline using Databricks.

Introduction to project and background

This project implements a Databricks ETL pipeline for retrieving and processing an airline safety dataset. Key features include a well-documented ETL notebook, Delta Lake for storage, Spark SQL for data transformations, and robust error handling with data validation. It incorporates data visualizations for insights and leverages an automated Databricks API trigger for continuous processing.

The workflow utilizes a Makefile to automate tasks such as installation, testing, formatting (Python Black), linting (Ruff), and an all-in-one operation via GitHub Actions, enhancing efficiency and code quality.

The dataset, airline-safety.csv, sourced from the Aviation Safety Network, contains safety records for 56 airlines. It details seat kilometers flown weekly and segregates incidents, fatal accidents, and fatalities into two periods: 1985–1999 and 2000–2014.

Steps taken in Databricks

Connect GitHub account to Databricks Workspace
Create global init script for cluster start to store environment variables
Establishes a connection to the Databricks environment using environment variables for authentication (SERVER_HOSTNAME, ACCESS_TOKEN and JOB_ID).
Create a Databricks cluster that supports Pyspark
Clone Github repo into Databricks workspace
Create a job on Databricks to build an ETL pipeline
Run the job and ensure that it has been completed (we should see the picture below)
Push to github

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Meron Gedrago miniproject Week 11

Structure for this project

Purpose of the project

Introduction to project and background

Steps taken in Databricks

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
Notebook_Folder		Notebook_Folder
mylib		mylib
visuals		visuals
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
main.py		main.py
pyspark_output.md		pyspark_output.md
repeat.sh		repeat.sh
requirements.txt		requirements.txt
run_job.py		run_job.py
setup.sh		setup.sh
test_main.py		test_main.py

License

nogibjj/Meron_Gedrago_mini_Week11

Folders and files

Latest commit

History

Repository files navigation

Meron Gedrago miniproject Week 11

Structure for this project

Purpose of the project

Introduction to project and background

Steps taken in Databricks

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages