Resume Retrieval and Question-Answering System

Project Description

The Resume Retrieval and Question-Answering System is an advanced application designed to streamline the hiring process through efficient resume retrieval and intelligent question answering. The system leverages the following key technologies:

Vector Database: Milvus:
- Used for storing and retrieving resume embeddings
- Supports both sparse and dense vector representations
- Enables hybrid search capabilities
Embedding Generation: BGEM3EmbeddingFunction:
- Generates both sparse and dense embeddings for resume content
- Supports CPU and GPU acceleration
Large Language Model: Mistral AI's Mixtral-8x7B-Instruct-v0.1:
- Utilized for generating human-like responses to queries based on retrieved resume contexts
Natural Language Processing Libraries:
- PyMilvus for interacting with the Milvus database
- Hugging Face's InferenceClient for interfacing with the large language model
Development Environment:
- Python-based implementation
- Supports both CPU and GPU processing

This tech stack combines the power of vector databases for efficient semantic search with advanced language models for intelligent query processing, creating a robust system for resume analysis and candidate matching.

Key Components

Data Preprocessing:
- The project begins with a preprocessing script (preprocess.py) that extracts and structures resume data from PDF files. This includes identifying key sections such as professional summaries, skills, work history, and education.
- The extracted text is tokenized and normalized, ensuring consistency and accuracy for subsequent analysis.
- Dense and sparse embeddings are generated using machine learning models, facilitating effective semantic search capabilities.
Vector Database Integration:
- The vector_db.py module is responsible for setting up and managing a Milvus vector database, which is optimized for storing and retrieving high-dimensional data like embeddings.
- It creates collections, defines schemas for the resume data (including fields for dense and sparse vectors), and inserts processed resume data into the database.
- This architecture allows for hybrid search capabilities, combining both dense and sparse vectors to enhance retrieval accuracy.
- Employs BGEM3EmbeddingFunction for generating embeddings.
- Utilizes Milvus for efficient vector storage and retrieval.
Search and Retrieval Pipeline:
- The pipeline.py file serves as the core of the retrieval mechanism. It includes methods for performing dense, sparse, and hybrid searches against the Milvus database.
- The system intelligently ranks the results based on semantic similarity, ensuring that users receive the most relevant resumes in response to their queries.
- It also includes functionality to generate embeddings for user queries, allowing for natural language searches that yield meaningful results.
- Utilizes a large language model (Mixtral-8x7B) for generating answers based on retrieved contexts.
- Uses both sparse and dense vector search capabilities of Milvus.
User Interface:
- The application is presented through a web interface built with Gradio (app.py), offering a chat-like experience for users to input queries and view results.
- Users can ask about specific skills, job titles, or other qualifications, and the system retrieves relevant resumes while providing concise, context-aware answers using a language model.
- The interface displays retrieved documents and allows for follow-up questions, enabling an interactive exploration of candidate profiles.

Overall Functionality

The Resume Retrieval and Question-Answering System enhances the recruitment process by:

Streamlining Candidate Search: Quickly identifies and retrieves relevant resumes based on specific criteria.
Improving Candidate Insights: Provides detailed answers about candidates’ qualifications and experiences, aiding decision-making for hiring managers.
Facilitating Interactive Exploration: Allows users to ask follow-up questions, ensuring a comprehensive understanding of the candidate pool.

Use Cases

HR professionals and recruiters can leverage the system to efficiently sift through large volumes of resumes, saving time and improving the quality of candidate selection.
Organizations can use the system to better match candidates to job requirements based on detailed insights derived from resume content.

Step 1: Clone the Repository

First, clone the repository containing the project files to your local machine:

git clone https://github.com/rahulsharmavishwakarma/resume_bot
cd resume_bot

Step 2: Set Up the Python Environment

Create a Virtual Environment (optional but recommended):
```
python -m venv venv
```
Activate the Virtual Environment:
- On Windows:
```
venv\Scripts\activate
```
- On macOS/Linux:
```
source venv/bin/activate
```

Step 3: Install Required Packages

Install the required packages listed in requirements.txt:

pip install -r requirements.txt

Step 4: Prepare the Resume Data

Organize Your PDF Resumes:

Place your PDF resumes into a directory structure, categorized by profession. For example:

/data/
    /software_engineer/
        resume1.pdf
        resume2.pdf
    /data_scientist/
        resume1.pdf
        resume2.pdf

Run the Preprocessing Script:
- Execute the preprocess.py script to extract and structure resume data:
```
python preprocess.py
```
- This will create a data.json file containing the structured resume data.

Step 5: Populate the Vector Database

Run the Vector Database Script:
- Execute the vector_db.py script to set up the Milvus vector database and insert the processed resume data:
```
python vector_db.py
```
- Ensure that the Milvus service is running before executing this step.

Step 6: Set Up the Retrieval Pipeline

Run the Pipeline Script (optional):
- If needed, test the pipeline functionality using pipeline.py to ensure the search mechanisms work correctly:
```
python pipeline.py
```

Step 7: Launch the Web Application

Run the Gradio App:
- Finally, execute the app.py script to start the web interface:
```
python app.py
```
Access the Application:
- Open a web browser and navigate to the provided Gradio URL (usually http://127.0.0.1:7860) to interact with the application.

Using the Application

Enter your search queries related to candidates' skills, experiences, or qualifications.
Review the retrieved documents and answers provided by the system.
Ask follow-up questions to explore the candidate profiles further.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Resume Retrieval and Question-Answering System

Project Description

Key Components

Overall Functionality

Use Cases

Step 1: Clone the Repository

Step 2: Set Up the Python Environment

Step 3: Install Required Packages

Step 4: Prepare the Resume Data

Step 5: Populate the Vector Database

Step 6: Set Up the Retrieval Pipeline

Step 7: Launch the Web Application

Using the Application

Files

README.md

Latest commit

History

README.md

File metadata and controls

Resume Retrieval and Question-Answering System

Project Description

Key Components

Overall Functionality

Use Cases

Step 1: Clone the Repository

Step 2: Set Up the Python Environment

Step 3: Install Required Packages

Step 4: Prepare the Resume Data

Step 5: Populate the Vector Database

Step 6: Set Up the Retrieval Pipeline

Step 7: Launch the Web Application

Using the Application