Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR fixes #2346
As described in the issue above, this "unexpected" space causes complete collapse for some evals (e.g. MMLU-Pro), and for others it blows up scores which makes the
--tasks=leaderboard
evaluation to produce very bad results compared to the ones available in https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboardWith this fix, results of Leaderboard-v2 are fully reproducible and have matching hashes (
doc_hash
,prompt_hash
,target_hash
) against the log files of Leaderboard-v2, which are available at: