[EVAL] Add ArenaHardAuto #325

lewtun · 2024-09-23T09:40:49Z

Evaluation short description

Why is this evaluation interesting?

Many benchmarks are getting saturated by new models. LMSYS has crowd-sourced a variety of hard prompts from the community and this provides a strong correlation with Elo scores.

How used is it in the community?

Recent papers are starting to report ArenaHard as a core metric to measure the improvements from new post-training methods. It is also becoming a new alternative to MT-Bench due to it's difficulty and real-world source of prompts.

Evaluation metadata

Provide all available

Paper url: https://arxiv.org/abs/2406.11939
Github url: https://github.com/lm-sys/arena-hard-auto
Dataset url: https://github.com/lm-sys/arena-hard-auto/blob/main/data/arena-hard-v0.1/question.jsonl

lewtun added the new task label Sep 23, 2024

clefourrier added the prio label Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EVAL] Add ArenaHardAuto #325

[EVAL] Add ArenaHardAuto #325

lewtun commented Sep 23, 2024

[EVAL] Add ArenaHardAuto #325

[EVAL] Add ArenaHardAuto #325

Comments

lewtun commented Sep 23, 2024

Evaluation short description

Evaluation metadata