Skip to content

prateekguptaiiitk/Resume_Filtering

Repository files navigation

Resume Filtering Using Machine Learning

                                                                     Resume filtering on the basis of Job Descriptions(JDs). It was a summer                                                                      internship project with Skybits Technologies Pvt. Ltd.

Language GitHub License Hits Binder

Introduction

The main feature of the current project is that it searches the entire resume database to select and display the resumes which fit the best for the provided job description(JD). This is, in its current form, achieved by assigning a score to each CV by intelligently comparing them against the corresponding Job Description. This reduces the window to a fraction of an original size of applicants. Resumes in the final window can be manually checked for further analysis. The project uses techniques in Machine Learning and Natural Language Processing to automate the process.

Directory Structure

.
├── Data
│   ├── CVs
│   ├── collectCV.py
│   └── jd.csv
├── Model
│   ├── Model_Training.ipynb
│   ├── Sentence_Extraction.ipynb
│   ├── paragraph_extraction_from_posts.ipynb
│   ├── sample_bitcoin.stackexchange_paras.txt
│   ├── sample_bitcoin.stackexchange_sentences.txt
├── Scoring
│   ├── CV_ranking.ipynb
│   ├── Using Spacy Model.ipynb
│   ├── With Word2Vec.ipynb
│   ├── context.jpg
│   └── prc_data.csv
└── Section Extraction
    ├── Section_Extraction.ipynb
    ├── convertDocxToText.py
    ├── convertPDFToText.py
    ├── extract.py
    └── get_jd.ipynb
    

Directory Details

  • CVs : Contains 250 extracted resumes in text format from indeed.com
  • collectCV.py : Python script to automate the process of extracting CVs from indeed.com. While this program is running, every new text copied to clipboard is saved as a CV in CVs/ directory in text format.
  • jd.csv : CSV file containing cleaned job descriptions from Kaggle. Dataset can be found here
  • Model_Training.ipynb : Notebook to train the word2vec model using gensim. The model was saved in ./model/ subdirectory (locally).
  • Sentence_Extraction.ipynb : Notebook for extracting cleaned sentences from extracted paragraphs.
  • paragraph_extraction_from_posts.ipynb : Notebook for extracting paragraphs from Posts.xml
  • sample_bitcoin.stackexchange_sentences.txt : It is the sentences.txt (pure sentences) file for bitcoin.stackexchange.com subdirectory of the dataset. It was generated from the corresponding paras.txt generated earlier using the code in sentence_extraction_from_paras.txt.ipynb. The process took around 12.5 hours to complete.
  • sample_bitcoin.stackexchange_paras.txt : It is the paras.txt (paragraph in html tags) file for bitcoin.stackexchange.com subdirectory of the dataset. It was generated from the Posts.xml using the code in paragraph_extraction_from_Posts.xml.ipynb
  • CV_ranking.ipynb : Notebook for ranking the CVs according to JDs(Job Description)
  • Using Spacy Model.ipynb : Demonstrates the need for a custom Word2Vector model rather than a general model trained otherwise. The similarity values generated by en_core_web_md spaCy model trained on Google News articles, do not reflect the technological sharpness required for the project.
  • With Word2Vec.ipynb : Demonstrates how to use word2vec to get similar words by words and similar words by vector. It also implements sent2vec() function. This function takes a sentence as a argument and returns a average vector for the sentence. Root Mean Square is used to average the vectors. The advantage of this function is to use it to find similar words for phrases which makes more sense while searching for roles etc.

For example:

'web engineer' will give 'engineer' as a similar word
  • context.jpg : Pie Chart showing the top three most frequent titles of Job Descriptions
  • prc_data.csv : CSV file storing processed sections of different resumes.
  • Section_Extraction.ipynb : Notebook for extracting sections from different resumes.
  • convertDocxToText.py : Python script for converting a .docx file to `.txt'.
  • convertPDFToText.py : Python scrtpt for converting a .pdf file to .txt.
  • extract.py : Python script for extracting compressed files in 7z.
  • get_jd.ipynb : Notebook for cleaning and extracting relevant portions of original jd.csv file from Kaggle.

Author

 Prateek Gupta