salaryprediction

Salary Prediction Project (Python)

The problem

I was given a feature dataset and a target dataset for training purposes, and also a test dataset for prediction purposes. The feature dataset contains job characteristics for which there is a corresponding salary (target dataset).

Exploratory Data Analysis

I started verifying that there were no null, missing or duplicated values, and that there were 5 categorical and 2 numeric features. I proceeded to merge the features and target training dataset into only 1 dataset. I used the IQR of the salary feature to determine upper and lower bounds to find outliers.

I verified that upper outliers with junior jobtype were still reasonable so I just cleaned the data by eliminating the outliers below the lower bound.

I checked for correlation between each feature and the target. I used label encoding in categorical features to be able to make a heatmap.

Distribution Plot

Ploting numerical features against target feature showed:

Positive linear relation between "Years of experience" and "Salary".
Negative linear relation between "Miles from Metropolis" and "Salary"

Heatmap of correlation between features

I started using Linear Regression as a baseline and selected MSE as a reasonable metric.

The following moodels were created: -Linear Regression -Pipeline with StandardScaler, PCA and Linear Regression. -Random Forest Regressor -Gradient Boosting Regressor

I did 5 fold cross validation on models and measured MSE.

The model with the lowest MSE was Gradient Boosting Regressor

Gradient Boosting Regressor was trained on entire training set and scored on test set to create predictions.

Files

CSV datasets can be found in the data folder
Images of this README can be found in the images folder
Model can be found in the model.txt file
Feature importances can be found in the feature_importances.csv file
Predictions on test set can be found in the predictions.csv file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

salaryprediction

The problem

Exploratory Data Analysis

Files

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
images		images
README.md		README.md
Salary Prediction Notebook.ipynb		Salary Prediction Notebook.ipynb
feature_importances.csv		feature_importances.csv
model.txt		model.txt
predictions.csv		predictions.csv

danielaaz04/salaryprediction

Folders and files

Latest commit

History

Repository files navigation

salaryprediction

The problem

Exploratory Data Analysis

Files

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages