You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Many benchmarks are getting saturated by new models. LMSYS has crowd-sourced a variety of hard prompts from the community and this provides a strong correlation with Elo scores.
How used is it in the community?
Recent papers are starting to report ArenaHard as a core metric to measure the improvements from new post-training methods. It is also becoming a new alternative to MT-Bench due to it's difficulty and real-world source of prompts.
Evaluation short description
Many benchmarks are getting saturated by new models. LMSYS has crowd-sourced a variety of hard prompts from the community and this provides a strong correlation with Elo scores.
Recent papers are starting to report ArenaHard as a core metric to measure the improvements from new post-training methods. It is also becoming a new alternative to MT-Bench due to it's difficulty and real-world source of prompts.
Evaluation metadata
Provide all available
The text was updated successfully, but these errors were encountered: