Locally reproducible HF-Leaderboard evals #2338

eldarkurtic · 2024-09-24T09:23:58Z

Hi folks,

I am trying to run HF-Leaderboard (v2) evals locally, and according to the blog https://huggingface.co/spaces/open-llm-leaderboard/blog the scores are normalized and random prediction accuracy is subtracted for all tasks where this is applicable.
Is this by any chance supported in lm-evaluation-harness, and if not, do you have any pointers where to look in order to add this behaviour?

baberabb · 2024-09-24T11:19:25Z

Hi! Their documentation links to this branch

eldarkurtic · 2024-09-24T11:30:06Z

I have already tried, but the reported scores are not matching the ones in https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard , and also their branch crashes with --apply_chat_template + vllm backend.
At the same time, this works just fine on main here, which is why I hoped to add these aggregation scores so the leaderboard is fully reproducible locally with lm-evaluation-harness.

maziyarpanahi · 2024-10-02T20:30:28Z

I also failed to use their repo/branch via vllm. The results are very off!

baberabb added the asking questions For asking for clarification / support on library usage. label Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Locally reproducible HF-Leaderboard evals #2338

Locally reproducible HF-Leaderboard evals #2338

eldarkurtic commented Sep 24, 2024

baberabb commented Sep 24, 2024

eldarkurtic commented Sep 24, 2024

maziyarpanahi commented Oct 2, 2024

Locally reproducible HF-Leaderboard evals #2338

Locally reproducible HF-Leaderboard evals #2338

Comments

eldarkurtic commented Sep 24, 2024

baberabb commented Sep 24, 2024

eldarkurtic commented Sep 24, 2024

maziyarpanahi commented Oct 2, 2024