- The purpose of this project is to build ETL-Query (Extract, Transform, Load, Query) pipeline via Python and Databricks
- I imported employee attrition dataset and pass it into a Databricks Delta table
- I established a connection Databricks and transform the data, building required data processing via SparkSQL queries
- Create a data pipiline using Databricks
- Include at least one data source and one data sink
- Databricks notebook or script
- Document demonstrating the pipeline
- Setting up Databricks - GitHub connection via API access tokens
- Dataset : 'Employee Attrition data' provided by IBM (reusing the path uploaded in the previous repository)
extract
: Downloads a dataset from a specified URL, cleans column names, and saves it as a Delta table in a specified Databricks DB.transform
: Handles missing values, saves cleaned the data as a new Delta table, and exports it to DBFS for data sink.load
: Checks if the table exists in a Databricks DB and, executes queries and displays the results using Spark.query
: Filters selected columns from a table and saves the result as a new Delta table in a Databricks DB.
- In the
transform
, a DBFS file is created for reusuability of the table. This allows the cleaned dataset to be passed intoquery
stage - In the
query
, SQL is used to extract only the required 7 columns and generate a new table for further analysis