From 1b793eee3c68b7b72a26ed715a37fb7da10086ef Mon Sep 17 00:00:00 2001
From: Claret Ibeawuchi <ibeawuchiclaret@gmail.com>
Date: Wed, 28 Aug 2024 14:11:39 -0700
Subject: [PATCH 1/2] Update README.md

---
 README.md | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 81 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index b51d38a..8962dea 100644
--- a/README.md
+++ b/README.md
@@ -111,8 +111,88 @@ flowchart TD
     N --> O[End]
 ```
 
+## Reason for Choice of Models:
+### `sentence-transformers/msmarco-distilbert-base-v4` (for CV Ranking)
+#### Pros:
+1. **Efficiency**: DistilBERT-based models like `msmarco-distilbert-base-v4` are lightweight and faster compared to larger models, making them suitable for applications where speed is critical.
+2. **Pre-trained on Relevant Data**: The `msmarco-distilbert-base-v4` model is fine-tuned on the MS MARCO dataset, which is designed for question-answering and information retrieval tasks. This makes it effective in ranking CVs based on similarity to a job description.
+3. **Good Accuracy**: Despite being lightweight, this model provides a good balance between accuracy and computational efficiency for ranking tasks.
+
+#### Cons:
+1. **Limited Expressiveness**: As a distilled model, it may lack some of the nuanced understanding that larger, more complex models can provide, especially in cases requiring deep semantic comprehension.
+2. **Potential for Missing Context**: While good for short queries, the model might struggle with capturing the full context of longer job descriptions or CVs.
+
+### `facebook/bart-large-cnn` (for CV Summarization)
+#### Pros:
+1. **Strong Summarization Capabilities**: BART is a powerful model for text generation tasks, including summarization, and `bart-large-cnn` is fine-tuned specifically for this purpose. It can produce coherent and contextually relevant summaries.
+2. **Ability to Handle Complex Text**: The model is designed to manage and summarize complex documents, making it suitable for condensing detailed CVs into concise summaries.
+3. **Pre-trained on Large Datasets**: The model benefits from extensive pre-training on diverse datasets, enabling it to generalize well across different types of content.
+
+#### Cons:
+1. **Input Token Limit**: `facebook/bart-large-cnn` has a maximum token limit (usually 1024 tokens), which can be a significant limitation when dealing with long CVs. This requires chunking the text, which might lead to loss of context and less accurate summaries.
+2. **Computationally Expensive**: BART is a large model, which makes it computationally expensive to run, especially if processing many CVs simultaneously. This can lead to longer processing times and higher resource usage.
+3. **Inconsistencies in Summaries**: The model might produce incoherent or incomplete summaries, especially when summarizing text chunks independently without considering the full document context.
+
+## Other Options Considered and Reason for not going with them:
+### **1. Ranking Alternative: `sentence-transformers/all-mpnet-base-v2`**
+#### **Cons:**
+- It requires more resources than `distilbert` models.
+- The model is larger in size, which might lead to slightly longer processing times and higher memory usage compared to `distilbert`.
+
+### **2. Summarization Alternative: `t5-large`**
+#### **Cons:**
+- T5-large is resource-intensive, requiring significant computational power for inference, especially when dealing with large volumes of text.
+- Due to its size, T5-large can be slower in generating summaries compared to smaller models like BART.
+- The model's size also means higher memory usage, which could be a limiting factor on resource-constrained systems.
+
+### **3. Ranking and Summarization Alternative: `longformer` (for Long Documents)**
+#### **Cons:**
+- `longformer` is more complex to fine-tune and may require more extensive training or adaptation to specific tasks compared to more standard models.
+-: Despite its efficiency with long texts, it still requires significant computational resources, especially when dealing with very long documents.
+- There might be fewer pre-trained versions fine-tuned specifically for tasks like summarization or ranking compared to models like BART, requiring more customization.
+
+
+## Reason for selecting these evaluation metrics
+
+### **1. Mean Reciprocal Rank (MRR)**
+#### **Pros:**
+- **Intuitive Interpretation**: MRR is simple to understand and calculate, providing a clear indication of how early in the ranked list the relevant items appear.
+- **Useful for Single Relevant Item**: MRR is particularly effective when there is only one relevant item per query, as it emphasizes the position of the first relevant item.
+- **Efficient Calculation**: MRR is computationally efficient, making it suitable for quick evaluations in applications where speed is essential.
+
+#### **Cons:**
+- **Single Relevant Item Focus**: MRR does not account for multiple relevant items within the ranked list; it only considers the rank of the first relevant item.
+- **Position Sensitivity**: MRR is highly sensitive to the position of the first relevant item but ignores the ranks of subsequent relevant items, potentially leading to a skewed evaluation if multiple relevant items exist.
+- **Binary Relevance Assumption**: MRR assumes a binary relevance (i.e., relevant or not), which might not fully capture the nuances of relevance in some contexts, such as ranking CVs by how well they match a job description.
+
+### **2. Normalized Discounted Cumulative Gain (NDCG) [Preferred]**
+#### **Pros:**
+- **Rank Position Sensitivity**: NDCG takes into account the position of relevant items in the ranked list, giving higher importance to items that appear earlier.
+- **Handles Multiple Relevant Items**: NDCG effectively handles cases where multiple relevant items are present, making it more versatile than MRR.
+- **Normalization**: The normalization aspect of NDCG allows for comparison across different queries or datasets, making it more robust in varied scenarios.
+
+#### **Cons:**
+- **Complexity**: NDCG is more complex to calculate compared to MRR, especially in cases with large datasets, which might require additional computational resources.
+- **Interpretability**: While NDCG is powerful, its interpretation is less intuitive than MRR, especially for non-experts, as it involves logarithmic discounting and normalization.
+- **Dependent on Relevance Scores**: NDCG relies on relevance scores for each item in the list, which means the effectiveness of NDCG can be impacted by the accuracy and reliability of these scores.
+
+### **3. BERTScore**
+#### **Pros:**
+- **Contextual Similarity**: BERTScore uses contextual embeddings from BERT, making it capable of capturing semantic similarity at a deeper level than traditional metrics like ROUGE or BLEU.
+- **Precision, Recall, F1**: BERTScore provides a detailed evaluation by offering precision, recall, and F1 scores, which gives a more comprehensive view of the generated text's quality.
+- **Robustness**: It is robust to variations in wording and synonyms, which makes it suitable for evaluating the quality of generated summaries, especially in contexts where exact word matches are not necessary.
+
+#### **Cons:**
+- **Computationally Intensive**: BERTScore is computationally heavy, as it requires running BERT or a similar model for each pair of sentences, making it slower and more resource-intensive than traditional metrics.
+- **Dependency on Pre-trained Models**: The quality of BERTScore depends heavily on the pre-trained model used, and its performance may vary depending on the model’s training data and domain.
+- **Less Interpretability**: While BERTScore is powerful, its results are less interpretable for non-experts, as they are based on complex embeddings rather than straightforward word overlaps.
+
+
+
+
+
 ## Future Improvements
 
 - Implement additional evaluation metrics.
 - Improve the robustness of the text extraction and cleaning processes.
-- Implement a good database structure with postgres.
+- Implement a good vector database structure with milvus or weaviate.

From 7ef53b946795d651db7d6357f74f0f2740f3a2e2 Mon Sep 17 00:00:00 2001
From: Claret Ibeawuchi <ibeawuchiclaret@gmail.com>
Date: Wed, 28 Aug 2024 19:58:45 -0700
Subject: [PATCH 2/2] Update README.md

Changed summarization model
---
 README.md | 37 ++++++++++++++++++++++---------------
 1 file changed, 22 insertions(+), 15 deletions(-)

diff --git a/README.md b/README.md
index 8962dea..c624a12 100644
--- a/README.md
+++ b/README.md
@@ -112,6 +112,24 @@ flowchart TD
 ```
 
 ## Reason for Choice of Models:
+
+###  `sshleifer/distilbart-cnn-12-6` (for CV Summarization)
+### Pros:
+1. **Efficiency**: `sshleifer/distilbart-cnn-12-6` is a distilled version of BART, which makes it smaller and faster than the original BART model (`facebook/bart-large-cnn`). This can lead to quicker inference times, especially beneficial when processing multiple CVs.
+
+2. **Good Performance**: Despite being a distilled model, `distilbart-cnn-12-6` retains much of the performance of its larger counterpart. It can generate coherent and relevant summaries, which is crucial for summarizing CVs effectively.
+
+3. **Reduced Resource Consumption**: The model requires less memory and computational power, making it easier to deploy in environments with limited resources or when scaling the application to handle many users simultaneously.
+
+4. **Compatibility with Hugging Face's Pipeline**: The model is easily integrable with Hugging Face's pipeline, allowing for straightforward implementation and fine-tuning if necessary.
+
+### Cons:
+1. **Limited Capacity**: As a smaller model, `distilbart-cnn-12-6` may not capture as much context or nuance as the full-sized BART model. This might lead to less detailed or accurate summaries, especially with complex or long CVs.
+
+2. **Potential for Truncation**: The model's input token limit (1024 tokens) might still be an issue for longer CVs, leading to truncation and potentially missing important information in the summaries. This is a trade-off between efficiency and coverage.
+
+3. **Reduced Flexibility in Summarization**: While the model is generally good at summarization, it might struggle with highly technical or domain-specific language present in CVs, which could result in less relevant summaries compared to a larger model fine-tuned specifically for this task.
+
 ### `sentence-transformers/msmarco-distilbert-base-v4` (for CV Ranking)
 #### Pros:
 1. **Efficiency**: DistilBERT-based models like `msmarco-distilbert-base-v4` are lightweight and faster compared to larger models, making them suitable for applications where speed is critical.
@@ -122,28 +140,17 @@ flowchart TD
 1. **Limited Expressiveness**: As a distilled model, it may lack some of the nuanced understanding that larger, more complex models can provide, especially in cases requiring deep semantic comprehension.
 2. **Potential for Missing Context**: While good for short queries, the model might struggle with capturing the full context of longer job descriptions or CVs.
 
-### `facebook/bart-large-cnn` (for CV Summarization)
-#### Pros:
-1. **Strong Summarization Capabilities**: BART is a powerful model for text generation tasks, including summarization, and `bart-large-cnn` is fine-tuned specifically for this purpose. It can produce coherent and contextually relevant summaries.
-2. **Ability to Handle Complex Text**: The model is designed to manage and summarize complex documents, making it suitable for condensing detailed CVs into concise summaries.
-3. **Pre-trained on Large Datasets**: The model benefits from extensive pre-training on diverse datasets, enabling it to generalize well across different types of content.
-
-#### Cons:
-1. **Input Token Limit**: `facebook/bart-large-cnn` has a maximum token limit (usually 1024 tokens), which can be a significant limitation when dealing with long CVs. This requires chunking the text, which might lead to loss of context and less accurate summaries.
-2. **Computationally Expensive**: BART is a large model, which makes it computationally expensive to run, especially if processing many CVs simultaneously. This can lead to longer processing times and higher resource usage.
-3. **Inconsistencies in Summaries**: The model might produce incoherent or incomplete summaries, especially when summarizing text chunks independently without considering the full document context.
-
 ## Other Options Considered and Reason for not going with them:
 ### **1. Ranking Alternative: `sentence-transformers/all-mpnet-base-v2`**
 #### **Cons:**
 - It requires more resources than `distilbert` models.
 - The model is larger in size, which might lead to slightly longer processing times and higher memory usage compared to `distilbert`.
 
-### **2. Summarization Alternative: `t5-large`**
+### **2. Summarization Alternative: `facebook/bart-large-cnn`**
 #### **Cons:**
-- T5-large is resource-intensive, requiring significant computational power for inference, especially when dealing with large volumes of text.
-- Due to its size, T5-large can be slower in generating summaries compared to smaller models like BART.
-- The model's size also means higher memory usage, which could be a limiting factor on resource-constrained systems.
+- `facebook/bart-large-cnn` has a maximum token limit (usually 1024 tokens), which can be a significant limitation when dealing with long CVs. This requires chunking the text, which might lead to loss of context and less accurate summaries.
+- BART is a large model, which makes it computationally expensive to run, especially if processing many CVs simultaneously. This can lead to longer processing times and higher resource usage.
+- The model might produce incoherent or incomplete summaries, especially when summarizing text chunks independently without considering the full document context.
 
 ### **3. Ranking and Summarization Alternative: `longformer` (for Long Documents)**
 #### **Cons:**