IDS-706 Data Engineering Assignment

Mini Project 10 : Introduction_to_PySpark

Status(CI/CD) badge

Proejct Purpose

This project uses PySpark to process a large dataset, focusing on running Spark SQL queries and performing data transformations. I am working with IBM's employee attrition dataset for these tasks.

Requirements

Use PySpark to perform data processing on a large dataset
Include at least one Spark SQL query and on data transformation

Deliverables

PySpark script
OUtput data or summaruy report(PDF or markdown)

Preperation

Run 'Codespaces'
Setup 'Pyspark' operation environment
Dataset :Employee Attrition data provided by IBM

Process

extract : Downloads the dataset from the specified URL
start_spark : Initiate a Spark session
load_data : Load the dataset from a CSV file into Spark Data Frame, Selecting only 7 of the 36 columns and creating sample data
describe : Generates descriptive statistice(e.g: Count, Mean)
query : Operates SQL query on the dataset using Spark SQL, based on Attrition values ('Yes', 'No')
example_transform : transforms the dataset by indexing categorical variables as intergers

Console log file

PySpark Output File

Remarks

Both lib.py and main.py generated logs. However, because the same process was repeated in main.py, this caused partial duplication in the log. To address this, main.py, deleted previous and rewrite the log.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
Data		Data
mylib		mylib
.DS_Store		.DS_Store
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
main.py		main.py
pyspark_output.md		pyspark_output.md
requirements.txt		requirements.txt
test_main.py		test_main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IDS-706 Data Engineering Assignment

Mini Project 10 : Introduction_to_PySpark

Status(CI/CD) badge

Proejct Purpose

Requirements

Deliverables

Preperation

Process

Console log file

Remarks

About

Releases

Packages

Languages

nogibjj/Mini_PJT_10_Introduction_to_PySpark_ISL

Folders and files

Latest commit

History

Repository files navigation

IDS-706 Data Engineering Assignment

Mini Project 10 : Introduction_to_PySpark

Status(CI/CD) badge

Proejct Purpose

Requirements

Deliverables

Preperation

Process

Console log file

Remarks

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages