This repository contains my work for the Pandas Descriptive Statistics Script assignment in IDS 706. The script reads a dataset, generates summary statistics, and creates a data visualization. To use it, simply link it to a GitHub codespace and wait for the devcontainer to run the Makefile, which will execute the following tasks: install, format, lint, and test.
This repository includes the following components:
.devcontainer
Makefile
requirements.txt
README.md
githubactions
Dockerfile
The purpose of this project is to create a Python script that performs descriptive statistics on a given dataset using Pandas. The script:
- Reads a dataset (CSV).
- Generates key summary statistics such as mean, median, and standard deviation.
- Creates a histogram to visualize the distribution of a numerical column.
The project uses matplotlib
for data visualization and provides a markdown report summarizing the results.
- Open codespaces.
- Load repo to codespaces.
- Wait for the installation of all the requirements in
requirements.txt
. - Run the Makefile code:
make all
.
This dataset provides details regarding employee salaries within a company. Each row corresponds to an individual employee, with columns capturing various attributes, including age, gender, education, job title, experience, and salary.
Columns:
- Age: The age of the employee, given as a numerical value in years.
- Gender: The employee’s gender, categorized as male or female.
- Education Level: Indicates the employee’s highest educational qualification, categorized as high school, bachelor’s degree, master’s degree, or PhD.
- Job Title: The position held by the employee within the company, with possible titles such as manager, analyst, engineer, or administrator.
- Years of Experience: The number of years the employee has been working, represented as a numeric value.
- Salary: The employee’s annual income, listed in US dollars, varying based on job title, experience, and education.
-
Summary Report: The script computes and outputs important summary statistics rounded to 2 decimal places for numerical columns like
Age
,Years of Experience
, andSalary
.Download the summary report
-
Salary Distribution Visualization: A histogram showcasing the distribution of salary in the dataset is generated.
Here’s an example of the summary statistics generated by the script:
Age | Years of Experience | Salary | |
---|---|---|---|
count | 373.0 | 373.0 | 373.00 |
mean | 37.43 | 10.03 | 100577.35 |
std | 7.07 | 6.56 | 48240.01 |
min | 23.00 | 0.00 | 350.00 |
50% | 36.00 | 9.00 | 95000.00 |
max | 53.00 | 25.00 | 250000.00 |
The script provides a clean and concise summary of the dataset's most important numerical fields