IDS-706 Data Engineering Assignment

Mini Project 11 : Using the Databricks Platform

Status(CI/CD) badge

Proejct Purpose

The purpose of this project is to build ETL-Query (Extract, Transform, Load, Query) pipeline via Python and Databricks
I imported employee attrition dataset and pass it into a Databricks Delta table
I established a connection Databricks and transform the data, building required data processing via SparkSQL queries

Requirements

Create a data pipiline using Databricks
Include at least one data source and one data sink

Deliverables

Databricks notebook or script
Document demonstrating the pipeline

Preperation

Setting up Databricks - GitHub connection via API access tokens
Dataset : 'Employee Attrition data' provided by IBM (reusing the path uploaded in the previous repository)

Process

extract : Downloads a dataset from a specified URL, cleans column names, and saves it as a Delta table in a specified Databricks DB.
transform : Handles missing values, saves cleaned the data as a new Delta table, and exports it to DBFS for data sink.
load : Checks if the table exists in a Databricks DB and, executes queries and displays the results using Spark.
query : Filters selected columns from a table and saves the result as a new Delta table in a Databricks DB.

Workflow

Remarks

In the transform, a DBFS file is created for reusuability of the table. This allows the cleaned dataset to be passed into query stage
In the query, SQL is used to extract only the required 7 columns and generate a new table for further analysis

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
mylib		mylib
.DS_Store		.DS_Store
.gitignore		.gitignore
Databrick_pipeline.png		Databrick_pipeline.png
Makefile		Makefile
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
test_main.py		test_main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IDS-706 Data Engineering Assignment

Mini Project 11 : Using the Databricks Platform

Status(CI/CD) badge

Proejct Purpose

Requirements

Deliverables

Preperation

Process

Workflow

Remarks

About

Releases

Packages

Languages

nogibjj/Mini_PJT_11_Using_the_Databricks_Platform

Folders and files

Latest commit

History

Repository files navigation

IDS-706 Data Engineering Assignment

Mini Project 11 : Using the Databricks Platform

Status(CI/CD) badge

Proejct Purpose

Requirements

Deliverables

Preperation

Process

Workflow

Remarks

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages