Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

14th of June Updates #3

Open
alebjanes opened this issue Jun 14, 2024 · 2 comments
Open

14th of June Updates #3

alebjanes opened this issue Jun 14, 2024 · 2 comments

Comments

@alebjanes
Copy link
Contributor

alebjanes commented Jun 14, 2024

Evaluation metrics

  1. Embeddings

1.1 Question ID = 1 (6,231 questions)

Model Question ID Correct matches Accuracy (%)
Mixtral 1 333 5.3
Llama3 1 398 6.4
all-mpnet-base-v2 1 4425 71.0
multi-qa-MiniLM-L6-cos-v1 1 4260 68.4
multi-qa-mpnet-base-cos-v1 1 4858 78.0
all-MiniLM-L12-v2 1 4027 64.6

1.2 Question ID = 2 (20,000 out of 46,872 questions)

Model Question ID Correct matches Accuracy (%)
Mixtral 2 ~1.5
Llama3 2 ~6.2
all-mpnet-base-v2 2 19734 98.7
multi-qa-MiniLM-L6-cos-v1 2 19639 98.2
multi-qa-mpnet-base-cos-v1 2 19782 98.9
all-MiniLM-L12-v2 2 19700 98.5

1.3 Question ID = 3 (15,000 out of 34,955 questions)

Model Question ID Correct matches Accuracy (%)
Mixtral 2 ~5.2
Llama3 2 ~7.5
all-mpnet-base-v2 2 13499 90.0
multi-qa-MiniLM-L6-cos-v1 2 13862 92.4
multi-qa-mpnet-base-cos-v1 2 14435 96.2
all-MiniLM-L12-v2 2 12860 85.7
  1. Fine-tuned Model

  2. Wrapper + ApiBuilder

  3. ApiBuilder

@alebjanes alebjanes changed the title Evaluation Metrics 14th of June Updates Jun 14, 2024
@pippo-sci
Copy link
Contributor

pippo-sci commented Jun 14, 2024

Fine-tuning model:

Trained 2 two models using small_corpus dataset on Unsloth in Google colab:

  • First model used 50k question and answer pairs trained one epoch on tinyLlama 1.1 B (Llama2 small version)
  • Second model trained for 5 epochs one same data same model.

Manual evaluation of subset of (50) questions randomly selected from the training set:

  • Model 1: did follow the answering-question structure, but it repeated itself. No number was accurate.
  • Model 2: Answer structure improves, but numbers are still off.

Next steps:
Since increasing the data size only increases the task complexity, increasing epochs seems to be the way to go, at least for this testing stage.
Also, training a bigger model (Llama2 7B Mistral 7B)
Move training to Caleuche (Free Colab resources are exhausted) and train 2 versions:

  • TinyLlama 10 epoch
  • Llama2/Mistral 7B 5 epoch

Evaluate them using the RAG framework to compare metrics.

@alexandersimoes
Copy link

Agreed next steps:

@alebjanes will work on finalizing the evaluation with the additional questions. If there are too many combinations of questions she will do a random sampling. She will post results next wednesday as she will be out for a week afterwards.

@pippo-sci's next steps written above ☝️ , once we evaluate these results with RAG framework we will determine next steps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants