Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
  • Loading branch information
Greyisheep committed Aug 29, 2024
2 parents ef131c5 + 7ef53b9 commit 7599903
Showing 1 changed file with 88 additions and 1 deletion.
89 changes: 88 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,8 +111,95 @@ flowchart TD
N --> O[End]
```

## Reason for Choice of Models:

### `sshleifer/distilbart-cnn-12-6` (for CV Summarization)
### Pros:
1. **Efficiency**: `sshleifer/distilbart-cnn-12-6` is a distilled version of BART, which makes it smaller and faster than the original BART model (`facebook/bart-large-cnn`). This can lead to quicker inference times, especially beneficial when processing multiple CVs.

2. **Good Performance**: Despite being a distilled model, `distilbart-cnn-12-6` retains much of the performance of its larger counterpart. It can generate coherent and relevant summaries, which is crucial for summarizing CVs effectively.

3. **Reduced Resource Consumption**: The model requires less memory and computational power, making it easier to deploy in environments with limited resources or when scaling the application to handle many users simultaneously.

4. **Compatibility with Hugging Face's Pipeline**: The model is easily integrable with Hugging Face's pipeline, allowing for straightforward implementation and fine-tuning if necessary.

### Cons:
1. **Limited Capacity**: As a smaller model, `distilbart-cnn-12-6` may not capture as much context or nuance as the full-sized BART model. This might lead to less detailed or accurate summaries, especially with complex or long CVs.

2. **Potential for Truncation**: The model's input token limit (1024 tokens) might still be an issue for longer CVs, leading to truncation and potentially missing important information in the summaries. This is a trade-off between efficiency and coverage.

3. **Reduced Flexibility in Summarization**: While the model is generally good at summarization, it might struggle with highly technical or domain-specific language present in CVs, which could result in less relevant summaries compared to a larger model fine-tuned specifically for this task.

### `sentence-transformers/msmarco-distilbert-base-v4` (for CV Ranking)
#### Pros:
1. **Efficiency**: DistilBERT-based models like `msmarco-distilbert-base-v4` are lightweight and faster compared to larger models, making them suitable for applications where speed is critical.
2. **Pre-trained on Relevant Data**: The `msmarco-distilbert-base-v4` model is fine-tuned on the MS MARCO dataset, which is designed for question-answering and information retrieval tasks. This makes it effective in ranking CVs based on similarity to a job description.
3. **Good Accuracy**: Despite being lightweight, this model provides a good balance between accuracy and computational efficiency for ranking tasks.

#### Cons:
1. **Limited Expressiveness**: As a distilled model, it may lack some of the nuanced understanding that larger, more complex models can provide, especially in cases requiring deep semantic comprehension.
2. **Potential for Missing Context**: While good for short queries, the model might struggle with capturing the full context of longer job descriptions or CVs.

## Other Options Considered and Reason for not going with them:
### **1. Ranking Alternative: `sentence-transformers/all-mpnet-base-v2`**
#### **Cons:**
- It requires more resources than `distilbert` models.
- The model is larger in size, which might lead to slightly longer processing times and higher memory usage compared to `distilbert`.

### **2. Summarization Alternative: `facebook/bart-large-cnn`**
#### **Cons:**
- `facebook/bart-large-cnn` has a maximum token limit (usually 1024 tokens), which can be a significant limitation when dealing with long CVs. This requires chunking the text, which might lead to loss of context and less accurate summaries.
- BART is a large model, which makes it computationally expensive to run, especially if processing many CVs simultaneously. This can lead to longer processing times and higher resource usage.
- The model might produce incoherent or incomplete summaries, especially when summarizing text chunks independently without considering the full document context.

### **3. Ranking and Summarization Alternative: `longformer` (for Long Documents)**
#### **Cons:**
- `longformer` is more complex to fine-tune and may require more extensive training or adaptation to specific tasks compared to more standard models.
-: Despite its efficiency with long texts, it still requires significant computational resources, especially when dealing with very long documents.
- There might be fewer pre-trained versions fine-tuned specifically for tasks like summarization or ranking compared to models like BART, requiring more customization.


## Reason for selecting these evaluation metrics

### **1. Mean Reciprocal Rank (MRR)**
#### **Pros:**
- **Intuitive Interpretation**: MRR is simple to understand and calculate, providing a clear indication of how early in the ranked list the relevant items appear.
- **Useful for Single Relevant Item**: MRR is particularly effective when there is only one relevant item per query, as it emphasizes the position of the first relevant item.
- **Efficient Calculation**: MRR is computationally efficient, making it suitable for quick evaluations in applications where speed is essential.

#### **Cons:**
- **Single Relevant Item Focus**: MRR does not account for multiple relevant items within the ranked list; it only considers the rank of the first relevant item.
- **Position Sensitivity**: MRR is highly sensitive to the position of the first relevant item but ignores the ranks of subsequent relevant items, potentially leading to a skewed evaluation if multiple relevant items exist.
- **Binary Relevance Assumption**: MRR assumes a binary relevance (i.e., relevant or not), which might not fully capture the nuances of relevance in some contexts, such as ranking CVs by how well they match a job description.

### **2. Normalized Discounted Cumulative Gain (NDCG) [Preferred]**
#### **Pros:**
- **Rank Position Sensitivity**: NDCG takes into account the position of relevant items in the ranked list, giving higher importance to items that appear earlier.
- **Handles Multiple Relevant Items**: NDCG effectively handles cases where multiple relevant items are present, making it more versatile than MRR.
- **Normalization**: The normalization aspect of NDCG allows for comparison across different queries or datasets, making it more robust in varied scenarios.

#### **Cons:**
- **Complexity**: NDCG is more complex to calculate compared to MRR, especially in cases with large datasets, which might require additional computational resources.
- **Interpretability**: While NDCG is powerful, its interpretation is less intuitive than MRR, especially for non-experts, as it involves logarithmic discounting and normalization.
- **Dependent on Relevance Scores**: NDCG relies on relevance scores for each item in the list, which means the effectiveness of NDCG can be impacted by the accuracy and reliability of these scores.

### **3. BERTScore**
#### **Pros:**
- **Contextual Similarity**: BERTScore uses contextual embeddings from BERT, making it capable of capturing semantic similarity at a deeper level than traditional metrics like ROUGE or BLEU.
- **Precision, Recall, F1**: BERTScore provides a detailed evaluation by offering precision, recall, and F1 scores, which gives a more comprehensive view of the generated text's quality.
- **Robustness**: It is robust to variations in wording and synonyms, which makes it suitable for evaluating the quality of generated summaries, especially in contexts where exact word matches are not necessary.

#### **Cons:**
- **Computationally Intensive**: BERTScore is computationally heavy, as it requires running BERT or a similar model for each pair of sentences, making it slower and more resource-intensive than traditional metrics.
- **Dependency on Pre-trained Models**: The quality of BERTScore depends heavily on the pre-trained model used, and its performance may vary depending on the model’s training data and domain.
- **Less Interpretability**: While BERTScore is powerful, its results are less interpretable for non-experts, as they are based on complex embeddings rather than straightforward word overlaps.





## Future Improvements

- Implement additional evaluation metrics.
- Improve the robustness of the text extraction and cleaning processes.
- Implement a good database structure with postgres.
- Implement a good vector database structure with milvus or weaviate.

0 comments on commit 7599903

Please sign in to comment.