A project aiming to leverage text embeddings and Milvus, a high-performance vector search engine, to detect duplicate job postings. The process involves generating embeddings from job descriptions and utilizing Milvus for efficient duplicate detection.
- Introduction
- Project Structure
- Requirements
- Installation
- Usage
- Results and Evaluation
- Docker Integration
- Video Demo
- Contributing
- License
- Author
The project focuses on the following key tasks:
- Data Preprocessing: Explore and clean job postings data, handling missing values and anomalies.
- Generating Embeddings: Utilize a pre-trained model (Sentence Transformers) to generate embeddings for job descriptions.
- Milvus for Duplicate Detection: Set up a Milvus instance, insert embeddings, and implement a method to search for potential duplicates.
- Docker/Docker Compose Integration: Containerize the project for easy reproducibility.
/job-posting-duplicate-detection
|-- data/
| |-- job_postings.csv
|-- embeddings/
| |-- generate_embeddings.py
|-- milvus/
| |-- milvus_setup.py
| |-- duplicate_detection.py
|-- Dockerfile
|-- docker-compose.yml
|-- video_demo/
| |-- demo_video.mp4
|-- README.md
- Python 3.x
- PyTorch
- Sentence Transformers
- pymilvus
Install dependencies using:
pip install -r requirements.txt
-
Clone the repository:
git clone https://github.com/arasgungore/job-posting-duplicate-detection.git
-
Navigate to the project directory:
cd job-posting-duplicate-detection
-
Install dependencies:
pip install -r requirements.txt
-
Data Preprocessing:
Explore and clean the data in the
data/job_postings.csv
file. -
Generating Embeddings:
Run the following command to generate embeddings:
python embeddings/generate_embeddings.py
-
Milvus for Duplicate Detection:
-
Set up Milvus instance:
python milvus/milvus_setup.py
-
Run duplicate detection:
python milvus/duplicate_detection.py
-
-
Docker/Docker Compose Integration:
-
Build and run the Docker image:
docker build -t job-posting-duplicate-detection . docker-compose up
-
Results and evaluation metrics are provided in the code comments of milvus/duplicate_detection.py
. The effectiveness of the duplicate detection method can be assessed based on precision, recall, and similarity threshold.
The project includes Docker and Docker Compose files (Dockerfile
and docker-compose.yml
) for containerization. This ensures a reproducible and isolated environment.
To build and run the Docker image, follow the instructions in the Usage section.
Watch the demo video for a quick overview of the project.
Contributions are welcome! Feel free to open issues or pull requests for any improvements or new features.
This project is licensed under the MIT License.
👤 Aras Güngöre
- LinkedIn: @arasgungore
- GitHub: @arasgungore