Resume Filtering Using Machine Learning

Resume filtering on the basis of Job Descriptions(JDs). It was a summer internship project with Skybits Technologies Pvt. Ltd.

Introduction

The main feature of the current project is that it searches the entire resume database to select and display the resumes which fit the best for the provided job description(JD). This is, in its current form, achieved by assigning a score to each CV by intelligently comparing them against the corresponding Job Description. This reduces the window to a fraction of an original size of applicants. Resumes in the final window can be manually checked for further analysis. The project uses techniques in Machine Learning and Natural Language Processing to automate the process.

Directory Structure

.
├── Data
│   ├── CVs
│   ├── collectCV.py
│   └── jd.csv
├── Model
│   ├── Model_Training.ipynb
│   ├── Sentence_Extraction.ipynb
│   ├── paragraph_extraction_from_posts.ipynb
│   ├── sample_bitcoin.stackexchange_paras.txt
│   ├── sample_bitcoin.stackexchange_sentences.txt
├── Scoring
│   ├── CV_ranking.ipynb
│   ├── Using Spacy Model.ipynb
│   ├── With Word2Vec.ipynb
│   ├── context.jpg
│   └── prc_data.csv
└── Section Extraction
    ├── Section_Extraction.ipynb
    ├── convertDocxToText.py
    ├── convertPDFToText.py
    ├── extract.py
    └── get_jd.ipynb

Directory Details

Data

CVs : Contains 250 extracted resumes in text format from indeed.com
collectCV.py : Python script to automate the process of extracting CVs from indeed.com. While this program is running, every new text copied to clipboard is saved as a CV in CVs/ directory in text format.
jd.csv : CSV file containing cleaned job descriptions from Kaggle. Dataset can be found here

Model

Model_Training.ipynb : Notebook to train the word2vec model using gensim. The model was saved in ./model/ subdirectory (locally).
Sentence_Extraction.ipynb : Notebook for extracting cleaned sentences from extracted paragraphs.
paragraph_extraction_from_posts.ipynb : Notebook for extracting paragraphs from Posts.xml
sample_bitcoin.stackexchange_sentences.txt : It is the sentences.txt (pure sentences) file for bitcoin.stackexchange.com subdirectory of the dataset. It was generated from the corresponding paras.txt generated earlier using the code in sentence_extraction_from_paras.txt.ipynb. The process took around 12.5 hours to complete.
sample_bitcoin.stackexchange_paras.txt : It is the paras.txt (paragraph in html tags) file for bitcoin.stackexchange.com subdirectory of the dataset. It was generated from the Posts.xml using the code in paragraph_extraction_from_Posts.xml.ipynb

Scoring

CV_ranking.ipynb : Notebook for ranking the CVs according to JDs(Job Description)
Using Spacy Model.ipynb : Demonstrates the need for a custom Word2Vector model rather than a general model trained otherwise. The similarity values generated by en_core_web_md spaCy model trained on Google News articles, do not reflect the technological sharpness required for the project.
With Word2Vec.ipynb : Demonstrates how to use word2vec to get similar words by words and similar words by vector. It also implements sent2vec() function. This function takes a sentence as a argument and returns a average vector for the sentence. Root Mean Square is used to average the vectors. The advantage of this function is to use it to find similar words for phrases which makes more sense while searching for roles etc.

For example:

'web engineer' will give 'engineer' as a similar word

context.jpg : Pie Chart showing the top three most frequent titles of Job Descriptions
prc_data.csv : CSV file storing processed sections of different resumes.

Section Extraction

Section_Extraction.ipynb : Notebook for extracting sections from different resumes.
convertDocxToText.py : Python script for converting a .docx file to `.txt'.
convertPDFToText.py : Python scrtpt for converting a .pdf file to .txt.
extract.py : Python script for extracting compressed files in 7z.
get_jd.ipynb : Notebook for cleaning and extracting relevant portions of original jd.csv file from Kaggle.

Author

 Prateek Gupta

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
Data		Data
Model		Model
Scoring		Scoring
Section Extraction		Section Extraction
LICENSE		LICENSE
README.md		README.md
Report.md		Report.md
SkyBits-logo-small.png		SkyBits-logo-small.png
requirements.txt		requirements.txt
stackexchange tree.png		stackexchange tree.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Resume Filtering Using Machine Learning

Introduction

Directory Structure

Directory Details

Data

Model

Scoring

Section Extraction

Author

About

Releases

Packages

Contributors 2

Languages

License

prateekguptaiiitk/Resume_Filtering

Folders and files

Latest commit

History

Repository files navigation

Resume Filtering Using Machine Learning

Introduction

Directory Structure

Directory Details

Data

Model

Scoring

Section Extraction

Author

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages