-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Which version to use #2322
Comments
Hi! I think the leaderboard tasks were only added in 0.4.4? Also I believe the leaderboard uses this fork to run their evaluations: https://github.com/huggingface/lm-evaluation-harness |
thank you. i checked the leader board they use 0.4.2 as today. That link you share seem to from version 0.4.3 |
Thank you for your response. I have actually reviewed their documentation. The results in the first table reflect the outcomes based on the installation guide provided in their docs. The second table corresponds to the repository link they mention in their documentation, which is the same one you recommended to me. The third table shows results for version 0.4.4. It seems that the leaderboard is typically using version 0.4.2, while the repository you recommended corresponds to version 0.4.3. Results from 0.4.3 are generally closer to those from 0.4.2, but depending on the task, performance can still vary significantly. As for version 0.4.4, which is the latest on this repo, the results differ considerably from the previous versions. That's why I’ve provided the results for each version, to highlight these differences. In some cases, models perform better in the newer version, but in others, they may perform worse. My main problem is i have some issue when trying to used version 0.4.2 in code where the model name is no a string such as ` lm_eval_model = HFLM(model, device=device,
i got this error. if i can fix that problem i think i will keep using version 0.4.2 so that i can benchmark my model with leaderboard models. |
the above experiments are conducted using the command line option. |
I'll look into this. For the error in 0.4.2, I think if you remove this condition and just return |
Thank you for your help. |
Hi! I made some runs on
main
HF
|
aha thank here is my command |
Could you explain why the performance of the same model changes significantly depending on the version of lm_eval?
for example with llama3-1-8b-instruct with a batch_size=1
v0.4.2 (leaderboard version)
with version v0.4.3
with version v0.4.4
I'm unsure which version to use, as my model performs well on version 0.4.4 but is outperformed by the base LLaMA 3.1-8B on version 0.4.2.
The text was updated successfully, but these errors were encountered: