You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to run HF-Leaderboard (v2) evals locally, and according to the blog https://huggingface.co/spaces/open-llm-leaderboard/blog the scores are normalized and random prediction accuracy is subtracted for all tasks where this is applicable.
Is this by any chance supported in lm-evaluation-harness, and if not, do you have any pointers where to look in order to add this behaviour?
The text was updated successfully, but these errors were encountered:
I have already tried, but the reported scores are not matching the ones in https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard , and also their branch crashes with --apply_chat_template + vllm backend.
At the same time, this works just fine on main here, which is why I hoped to add these aggregation scores so the leaderboard is fully reproducible locally with lm-evaluation-harness.
Hi folks,
I am trying to run HF-Leaderboard (v2) evals locally, and according to the blog https://huggingface.co/spaces/open-llm-leaderboard/blog the scores are normalized and random prediction accuracy is subtracted for all tasks where this is applicable.
Is this by any chance supported in lm-evaluation-harness, and if not, do you have any pointers where to look in order to add this behaviour?
The text was updated successfully, but these errors were encountered: