IDS706 Pandas Assignment by Kaisen Yao

This repository contains my work for the Pandas Descriptive Statistics Script assignment in IDS 706. The script reads a dataset, generates summary statistics, and creates a data visualization. To use it, simply link it to a GitHub codespace and wait for the devcontainer to run the Makefile, which will execute the following tasks: install, format, lint, and test.

This repository includes the following components:

.devcontainer
Makefile
requirements.txt
README.md
githubactions
Dockerfile

Purpose

The purpose of this project is to create a Python script that performs descriptive statistics on a given dataset using Pandas. The script:

Reads a dataset (CSV).
Generates key summary statistics such as mean, median, and standard deviation.
Creates a histogram to visualize the distribution of a numerical column.

The project uses matplotlib for data visualization and provides a markdown report summarizing the results.

Preparation

Open codespaces.
Load repo to codespaces.
Wait for the installation of all the requirements in requirements.txt.
Run the Makefile code: make all.

About the Dataset

This dataset provides details regarding employee salaries within a company. Each row corresponds to an individual employee, with columns capturing various attributes, including age, gender, education, job title, experience, and salary.

Columns:

Age: The age of the employee, given as a numerical value in years.
Gender: The employee’s gender, categorized as male or female.
Education Level: Indicates the employee’s highest educational qualification, categorized as high school, bachelor’s degree, master’s degree, or PhD.
Job Title: The position held by the employee within the company, with possible titles such as manager, analyst, engineer, or administrator.
Years of Experience: The number of years the employee has been working, represented as a numeric value.
Salary: The employee’s annual income, listed in US dollars, varying based on job title, experience, and education.

Outputs

Summary Report: The script computes and outputs important summary statistics rounded to 2 decimal places for numerical columns like Age, Years of Experience, and Salary.

Download the summary report
Salary Distribution Visualization: A histogram showcasing the distribution of salary in the dataset is generated.

Example Output

Here’s an example of the summary statistics generated by the script:

	Age	Years of Experience	Salary
count	373.0	373.0	373.00
mean	37.43	10.03	100577.35
std	7.07	6.56	48240.01
min	23.00	0.00	350.00
50%	36.00	9.00	95000.00
max	53.00	25.00	250000.00

The script provides a clean and concise summary of the dataset's most important numerical fields

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
.coverage		.coverage
.gitignore		.gitignore
Dataset Overview.md		Dataset Overview.md
Makefile		Makefile
README.md		README.md
data_visualization.png		data_visualization.png
main.py		main.py
requirements.txt		requirements.txt
salary.csv		salary.csv
summary_report.md		summary_report.md
test_main.py		test_main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IDS706 Pandas Assignment by Kaisen Yao

Purpose

Preparation

About the Dataset

Outputs

Example Output

About

Releases

Packages

Contributors 2

Languages

kaisenyao/Pandas_descriptive

Folders and files

Latest commit

History

Repository files navigation

IDS706 Pandas Assignment by Kaisen Yao

Purpose

Preparation

About the Dataset

Outputs

Example Output

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages