14th of June Updates #3

alebjanes · 2024-06-14T18:29:38Z

Evaluation metrics

Embeddings

1.1 Question ID = 1 (6,231 questions)

Model	Question ID	Correct matches	Accuracy (%)
Mixtral	1	333	5.3
Llama3	1	398	6.4
all-mpnet-base-v2	1	4425	71.0
multi-qa-MiniLM-L6-cos-v1	1	4260	68.4
multi-qa-mpnet-base-cos-v1	1	4858	78.0
all-MiniLM-L12-v2	1	4027	64.6

1.2 Question ID = 2 (20,000 out of 46,872 questions)

Model	Question ID	Correct matches	Accuracy (%)
Mixtral	2		~1.5
Llama3	2		~6.2
all-mpnet-base-v2	2	19734	98.7
multi-qa-MiniLM-L6-cos-v1	2	19639	98.2
multi-qa-mpnet-base-cos-v1	2	19782	98.9
all-MiniLM-L12-v2	2	19700	98.5

1.3 Question ID = 3 (15,000 out of 34,955 questions)

Model	Question ID	Correct matches	Accuracy (%)
Mixtral	2		~5.2
Llama3	2		~7.5
all-mpnet-base-v2	2	13499	90.0
multi-qa-MiniLM-L6-cos-v1	2	13862	92.4
multi-qa-mpnet-base-cos-v1	2	14435	96.2
all-MiniLM-L12-v2	2	12860	85.7

Fine-tuned Model
Wrapper + ApiBuilder
ApiBuilder

pippo-sci · 2024-06-14T19:01:28Z

Fine-tuning model:

Trained 2 two models using small_corpus dataset on Unsloth in Google colab:

First model used 50k question and answer pairs trained one epoch on tinyLlama 1.1 B (Llama2 small version)
Second model trained for 5 epochs one same data same model.

Manual evaluation of subset of (50) questions randomly selected from the training set:

Model 1: did follow the answering-question structure, but it repeated itself. No number was accurate.
Model 2: Answer structure improves, but numbers are still off.

Next steps:
Since increasing the data size only increases the task complexity, increasing epochs seems to be the way to go, at least for this testing stage.
Also, training a bigger model (Llama2 7B Mistral 7B)
Move training to Caleuche (Free Colab resources are exhausted) and train 2 versions:

TinyLlama 10 epoch
Llama2/Mistral 7B 5 epoch

Evaluate them using the RAG framework to compare metrics.

alexandersimoes · 2024-06-14T22:10:21Z

Agreed next steps:

@alebjanes will work on finalizing the evaluation with the additional questions. If there are too many combinations of questions she will do a random sampling. She will post results next wednesday as she will be out for a week afterwards.

@pippo-sci's next steps written above ☝️ , once we evaluate these results with RAG framework we will determine next steps

alebjanes changed the title ~~Evaluation Metrics~~ 14th of June Updates Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

14th of June Updates #3

14th of June Updates #3

alebjanes commented Jun 14, 2024 •

edited

Loading

pippo-sci commented Jun 14, 2024 •

edited

Loading

alexandersimoes commented Jun 14, 2024

14th of June Updates #3

14th of June Updates #3

Comments

alebjanes commented Jun 14, 2024 • edited Loading

Evaluation metrics

pippo-sci commented Jun 14, 2024 • edited Loading

Fine-tuning model:

alexandersimoes commented Jun 14, 2024

alebjanes commented Jun 14, 2024 •

edited

Loading

pippo-sci commented Jun 14, 2024 •

edited

Loading