add mmlu readme (#2282)

EleutherAI · Sep 26, 2024 · 00f5537 · 00f5537
1 parent deb4328
commit 00f5537
Showing 1 changed file with 73 additions and 0 deletions.
diff --git a/lm_eval/tasks/mmlu/README.md b/lm_eval/tasks/mmlu/README.md
@@ -0,0 +1,73 @@
+# Task-name
+
+### Paper
+
+Title: `Measuring Massive Multitask Language Understanding`
+
+Abstract: `https://arxiv.org/abs/2009.03300`
+
+`The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.`
+
+Homepage: `https://github.com/hendrycks/test`
+
+Note: The `Flan` variants are derived from [here](https://github.com/jasonwei20/flan-2), and as described in Appendix D.1 of [Scaling Instruction-Finetuned Language Models](https://arxiv.org/abs/2210.11416).
+
+### Citation
+
+```
+@article{hendryckstest2021,
+ title={Measuring Massive Multitask Language Understanding},
+ author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},
+ journal={Proceedings of the International Conference on Learning Representations (ICLR)},
+ year={2021}
+}
+
+@article{hendrycks2021ethics,
+ title={Aligning AI With Shared Human Values},
+ author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},
+ journal={Proceedings of the International Conference on Learning Representations (ICLR)},
+ year={2021}
+}
+```
+
+### Groups, Tags, and Tasks
+
+#### Groups
+
+* `mmlu`: `Original multiple-choice MMLU benchmark`
+* `mmlu_continuation`: `MMLU but with continuation prompts`
+* `mmlu_generation`: `MMLU generation`
+
+MMLU is the original benchmark as implemented by Hendrycks et al. with the choices in context and the answer letters (e.g `A`, `B`, `C`, `D`) in the continuation.
+`mmlu_continuation` is a cloze-style variant without the choices in context and the full answer choice in the continuation.
+`mmlu_generation` is a generation variant, similar to the original but the LLM is asked to generate the correct answer letter.
+
+
+#### Subgroups
+
+* `mmlu_stem'
+* `mmlu_humanities'
+* `mmlu_social_sciences'
+* `mmlu_other'
+
+Subgroup variants are prefixed with the subgroup name, e.g. `mmlu_stem_continuation`.
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [x] Is the task an existing benchmark in the literature?
+ * [x] Have you referenced the original paper that introduced the task?
+ * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [x] Is the "Main" variant of this task clearly denoted?
+* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
+
+# changelog
+ver 1: PR #497
+switch to original implementation
+
+ver 2: PR #2116
+add missing newline in description.