- This project uses PySpark to process a large dataset, focusing on running Spark SQL queries and performing data transformations. I am working with IBM's employee attrition dataset for these tasks.
- Use PySpark to perform data processing on a large dataset
- Include at least one Spark SQL query and on data transformation
- PySpark script
- OUtput data or summaruy report(PDF or markdown)
- Run 'Codespaces'
- Setup 'Pyspark' operation environment
- Dataset :Employee Attrition data provided by IBM
extract
: Downloads the dataset from the specified URLstart_spark
: Initiate a Spark sessionload_data
: Load the dataset from a CSV file into Spark Data Frame, Selecting only 7 of the 36 columns and creating sample datadescribe
: Generates descriptive statistice(e.g: Count, Mean)query
: Operates SQL query on the dataset using Spark SQL, based on Attrition values ('Yes', 'No')example_transform
: transforms the dataset by indexing categorical variables as intergers
- Both
lib.py
andmain.py
generated logs. However, because the same process was repeated inmain.py
, this caused partial duplication in the log. To address this,main.py
, deleted previous and rewrite the log.