Tech Job prediction

An end-to-end data science project for predicting the required skills needed for a specific tech job and jobs that best fit your current skills.

Project Organization

├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third-party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── mlflow                  <- mlflow experiment runs.
│   ├── experimentid        <- experiment runs with all its data.
│   ├── runid               <- Each run has a separate folder.
│       └── artifact        <- Run artifacts folder (models, etc.).
│       └── metrics         <- Run metrics folder.
│       └── param           <- Run parameters folder.
│
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting

Motivation

This project aims to simplify career decisions in the tech industry by creating a user-friendly platform that predicts the essential skills for specific jobs plus recommending jobs based on an individual's current skill set. The system will analyze job descriptions using supervised classification and machine learning to provide accurate skill predictions and offer personalized job matches.

Data Source

The data source for this project was the Stack overflow developers survey. This survey contains over 80K responses from people working in different tech jobs with corresponding skills in different tech areas.

Technologies and methodology:

Model: Xgboost classifier
Machine learning lifecycle organization: MLflow
API development: Apache Flask
Web app development: Dash Plotly

Analytics at a glance

response distribution across countries

Jobs frequency per country

skills frequency of US responses (it's nearly the same for the rest of the countries)

Heatmap showing the specificity of each skill per job type

Jobs correlation between each other

Features Engineering

Full stack and back-end developer classes represented the majority of the responses, in addition to having the potential to include sub-profiles (eg. backend can be a Java developer or C++ developer etc.). I decided to cluster them and extract useful sub-profiles from them to increase the specificity of job profiles, for this task, I used DBscan because it can deal with clusters of arbitrary shapes and densities.
I merged scientist and researcher classes as they represent more or less the same job profile

Model development

First I used linear regression as a base model to compare other models to it

Model	Recall score
Linear regression	37%
Random forest	82%
Xgboost	82%

Why recall score? Because it was more important for me the percentage of detecting the job right than having false positive results.

Xgboost and Random Forest gave nearly equal performance but I chose Xgboost as it is smaller in size compared to Random Forest, which was around 9GB of size.

Limitations and what can be improved

Hyperparameter tuning with grid search or random search.
More feature engineering and statistical analysis.

Explore the notebook

To explore the notebook here

APP Demo

https://youtu.be/up6f-HBg1H0?si=zrbtpbfAM3XoDJLm

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
dashboard		dashboard
mlflow		mlflow
notebooks		notebooks
reports		reports
scripts		scripts
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tech Job prediction

Project Organization

Motivation

Data Source

Technologies and methodology:

Analytics at a glance

response distribution across countries

Jobs frequency per country

skills frequency of US responses (it's nearly the same for the rest of the countries)

Heatmap showing the specificity of each skill per job type

Jobs correlation between each other

Features Engineering

Model development

Limitations and what can be improved

Explore the notebook

APP Demo

About

Releases

Packages

Contributors 2

Languages

aya9aladdin/Tech-Job-profile-prediction-dsProject

Folders and files

Latest commit

History

Repository files navigation

Tech Job prediction

Project Organization

Motivation

Data Source

Technologies and methodology:

Analytics at a glance

response distribution across countries

Jobs frequency per country

skills frequency of US responses (it's nearly the same for the rest of the countries)

Heatmap showing the specificity of each skill per job type

Jobs correlation between each other

Features Engineering

Model development

Limitations and what can be improved

Explore the notebook

APP Demo

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages