Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Locally reproducible HF-Leaderboard evals #2338

Open
eldarkurtic opened this issue Sep 24, 2024 · 3 comments
Open

Locally reproducible HF-Leaderboard evals #2338

eldarkurtic opened this issue Sep 24, 2024 · 3 comments
Labels
asking questions For asking for clarification / support on library usage.

Comments

@eldarkurtic
Copy link
Contributor

Hi folks,

I am trying to run HF-Leaderboard (v2) evals locally, and according to the blog https://huggingface.co/spaces/open-llm-leaderboard/blog the scores are normalized and random prediction accuracy is subtracted for all tasks where this is applicable.
Is this by any chance supported in lm-evaluation-harness, and if not, do you have any pointers where to look in order to add this behaviour?

@baberabb
Copy link
Contributor

Hi! Their documentation links to this branch

@baberabb baberabb added the asking questions For asking for clarification / support on library usage. label Sep 24, 2024
@eldarkurtic
Copy link
Contributor Author

I have already tried, but the reported scores are not matching the ones in https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard , and also their branch crashes with --apply_chat_template + vllm backend.
At the same time, this works just fine on main here, which is why I hoped to add these aggregation scores so the leaderboard is fully reproducible locally with lm-evaluation-harness.

@maziyarpanahi
Copy link

I also failed to use their repo/branch via vllm. The results are very off!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
asking questions For asking for clarification / support on library usage.
Projects
None yet
Development

No branches or pull requests

3 participants