Skip to content

A framework for few-shot evaluation of autoregressive language models.

License

Notifications You must be signed in to change notification settings

HomebrewNLP/lm-evaluation-harness

Β 
Β 

Repository files navigation

Language Model Evaluation Harness

codecov

Overview

This project provides a unified framework to test autoregressive language models (GPT-2, GPT-3, GPTNeo, etc) on a large number of different evaluation tasks.

Features:

  • 200+ tasks implemented
  • Support for GPT-2, GPT-3, GPT-Neo, GPT-NeoX, and GPT-J, with flexible tokenization-agnostic interface
  • Task versioning to ensure reproducibility

Install

pip install lm-eval

Basic Usage

To evaluate a model, (e.g. GPT-2) on NLU tasks (e.g. LAMBADA, HellaSwag), you can run the following command. When reporting results from eval harness, please include the task versions (shown in results["versions"]) for reproducibility. This allows bug fixes to tasks while also ensuring that previously reported scores are reproducible. See the Task Versioning section for more info.

python main.py \
	--model gpt2 \
	--device 0 \
	--tasks lambada,hellaswag

(This uses gpt2-117M by default as per HF defaults, use --model_args to specify other gpt2 sizes)

Additional arguments can be provided to the model constructor using the --model_args flag. Most importantly, the gpt2 model can be used to load an arbitrary HuggingFace model. For example, to run GPTNeo use the following:

python main.py \
	--model gpt2 \
	--model_args pretrained=EleutherAI/gpt-neo-2.7B \
	--device 0 \
	--tasks lambada,hellaswag

If you have access to the OpenAI API, you can also evaluate GPT-3:

export OPENAI_API_SECRET_KEY=YOUR_KEY_HERE
python main.py \
	--model gpt3 \
	--model_args engine=davinci \
	--tasks lambada,hellaswag

And if you want to verify the data integrity of the tasks you're performing in addition to running the tasks themselves, you can use the --check_integrity flag:

python main.py \
	--model gpt3 \
	--model_args engine=davinci \
	--tasks lambada,hellaswag \
	--check_integrity

To evaluate mesh-transformer-jax models that are not available on HF, please invoke eval harness through this script.

Implementing new tasks

To implement a new task in eval harness, see this guide.

Cite as

@software{eval-harness,
  author       = {Gao, Leo and
                  Tow, Jonathan and
                  Biderman, Stella and
                  Black, Sid and
                  DiPofi, Anthony and
                  Foster, Charles and
                  Golding, Laurence and
                  Hsu, Jeffrey and
                  McDonell, Kyle and
                  Muennighoff, Niklas and
                  Phang, Jason and
                  Reynolds, Laria and
                  Tang, Eric and
                  Thite, Anish and
                  Wang, Ben and
                  Wang, Kevin and
                  Zou, Andy},
  title        = {A framework for few-shot language model evaluation},
  month        = sep,
  year         = 2021,
  publisher    = {Zenodo},
  version      = {v0.0.1},
  doi          = {10.5281/zenodo.5371628},
  url          = {https://doi.org/10.5281/zenodo.5371628}
}

Full Task List

Task Name Train Val Test Val/Test Docs Metrics
cola βœ“ βœ“ 1043 mcc
mnli βœ“ βœ“ 9815 acc
mnli_mismatched βœ“ βœ“ 9832 acc
mrpc βœ“ βœ“ 408 acc, f1
rte βœ“ βœ“ 277 acc
qnli βœ“ βœ“ 5463 acc
qqp βœ“ βœ“ 40430 acc, f1
sst βœ“ βœ“ 872 acc
wnli βœ“ βœ“ 71 acc
boolq βœ“ βœ“ 3270 acc
cb βœ“ βœ“ 56 acc, f1
copa βœ“ βœ“ 100 acc
multirc βœ“ βœ“ 4848 acc
record βœ“ βœ“ 10000 f1, em
wic βœ“ βœ“ 638 acc
wsc βœ“ βœ“ 104 acc
coqa βœ“ βœ“ 500 f1, em
drop βœ“ βœ“ 9536 em, f1
lambada βœ“ 5153 ppl, acc
lambada_cloze βœ“ 5153 ppl, acc
lambada_mt_en βœ“ 5153 ppl, acc
lambada_mt_fr βœ“ 5153 ppl, acc
lambada_mt_de βœ“ 5153 ppl, acc
lambada_mt_it βœ“ 5153 ppl, acc
lambada_mt_es βœ“ 5153 ppl, acc
wikitext βœ“ βœ“ 62 word_perplexity, byte_perplexity, bits_per_byte
piqa βœ“ βœ“ 1838 acc, acc_norm
prost βœ“ 18736 acc, acc_norm
mc_taco βœ“ βœ“ 9442 f1, em
pubmedqa βœ“ 1000 acc
sciq βœ“ βœ“ βœ“ 1000 acc, acc_norm
qa4mre_2011 βœ“ 120 acc, acc_norm
qa4mre_2012 βœ“ 160 acc, acc_norm
qa4mre_2013 βœ“ 284 acc, acc_norm
triviaqa βœ“ βœ“ 11313 acc
arc_easy βœ“ βœ“ βœ“ 2376 acc, acc_norm
arc_challenge βœ“ βœ“ βœ“ 1172 acc, acc_norm
logiqa βœ“ βœ“ βœ“ 651 acc, acc_norm
hellaswag βœ“ βœ“ 10042 acc, acc_norm
openbookqa βœ“ βœ“ βœ“ 500 acc, acc_norm
squad2 βœ“ βœ“ 11873 exact, f1, HasAns_exact, HasAns_f1, NoAns_exact, NoAns_f1, best_exact, best_f1
race βœ“ βœ“ βœ“ 1045 acc
headqa βœ“ βœ“ βœ“ 2742 acc, acc_norm
headqa_es βœ“ βœ“ βœ“ 2742 acc, acc_norm
headqa_en βœ“ βœ“ βœ“ 2742 acc, acc_norm
mathqa βœ“ βœ“ βœ“ 2985 acc, acc_norm
webqs βœ“ βœ“ 2032 acc
wsc273 βœ“ 273 acc
winogrande βœ“ βœ“ 1267 acc
anli_r1 βœ“ βœ“ βœ“ 1000 acc
anli_r2 βœ“ βœ“ βœ“ 1000 acc
anli_r3 βœ“ βœ“ βœ“ 1200 acc
ethics_cm βœ“ βœ“ 3885 acc
ethics_deontology βœ“ βœ“ 3596 acc, em
ethics_justice βœ“ βœ“ 2704 acc, em
ethics_utilitarianism_original βœ“ 4808 acc
ethics_utilitarianism βœ“ βœ“ 4808 acc
ethics_virtue βœ“ βœ“ 4975 acc, em
truthfulqa_mc βœ“ 817 mc1, mc2
truthfulqa_gen βœ“ 817 bleurt_max, bleurt_acc, bleurt_diff, bleu_max, bleu_acc, bleu_diff, rouge1_max, rouge1_acc, rouge1_diff, rouge2_max, rouge2_acc, rouge2_diff, rougeL_max, rougeL_acc, rougeL_diff
mutual βœ“ βœ“ 886 r@1, r@2, mrr
mutual_plus βœ“ βœ“ 886 r@1, r@2, mrr
math_algebra βœ“ βœ“ 1187 acc
math_counting_and_prob βœ“ βœ“ 474 acc
math_geometry βœ“ βœ“ 479 acc
math_intermediate_algebra βœ“ βœ“ 903 acc
math_num_theory βœ“ βœ“ 540 acc
math_prealgebra βœ“ βœ“ 871 acc
math_precalc βœ“ βœ“ 546 acc
math_asdiv βœ“ 2305 acc
arithmetic_2da βœ“ 2000 acc
arithmetic_2ds βœ“ 2000 acc
arithmetic_3da βœ“ 2000 acc
arithmetic_3ds βœ“ 2000 acc
arithmetic_4da βœ“ 2000 acc
arithmetic_4ds βœ“ 2000 acc
arithmetic_5da βœ“ 2000 acc
arithmetic_5ds βœ“ 2000 acc
arithmetic_2dm βœ“ 2000 acc
arithmetic_1dc βœ“ 2000 acc
hendrycksTest-abstract_algebra βœ“ βœ“ βœ“ 100 acc, acc_norm
hendrycksTest-anatomy βœ“ βœ“ βœ“ 135 acc, acc_norm
hendrycksTest-astronomy βœ“ βœ“ βœ“ 152 acc, acc_norm
hendrycksTest-business_ethics βœ“ βœ“ βœ“ 100 acc, acc_norm
hendrycksTest-clinical_knowledge βœ“ βœ“ βœ“ 265 acc, acc_norm
hendrycksTest-college_biology βœ“ βœ“ βœ“ 144 acc, acc_norm
hendrycksTest-college_chemistry βœ“ βœ“ βœ“ 100 acc, acc_norm
hendrycksTest-college_computer_science βœ“ βœ“ βœ“ 100 acc, acc_norm
hendrycksTest-college_mathematics βœ“ βœ“ βœ“ 100 acc, acc_norm
hendrycksTest-college_medicine βœ“ βœ“ βœ“ 173 acc, acc_norm
hendrycksTest-college_physics βœ“ βœ“ βœ“ 102 acc, acc_norm
hendrycksTest-computer_security βœ“ βœ“ βœ“ 100 acc, acc_norm
hendrycksTest-conceptual_physics βœ“ βœ“ βœ“ 235 acc, acc_norm
hendrycksTest-econometrics βœ“ βœ“ βœ“ 114 acc, acc_norm
hendrycksTest-electrical_engineering βœ“ βœ“ βœ“ 145 acc, acc_norm
hendrycksTest-elementary_mathematics βœ“ βœ“ βœ“ 378 acc, acc_norm
hendrycksTest-formal_logic βœ“ βœ“ βœ“ 126 acc, acc_norm
hendrycksTest-global_facts βœ“ βœ“ βœ“ 100 acc, acc_norm
hendrycksTest-high_school_biology βœ“ βœ“ βœ“ 310 acc, acc_norm
hendrycksTest-high_school_chemistry βœ“ βœ“ βœ“ 203 acc, acc_norm
hendrycksTest-high_school_computer_science βœ“ βœ“ βœ“ 100 acc, acc_norm
hendrycksTest-high_school_european_history βœ“ βœ“ βœ“ 165 acc, acc_norm
hendrycksTest-high_school_geography βœ“ βœ“ βœ“ 198 acc, acc_norm
hendrycksTest-high_school_government_and_politics βœ“ βœ“ βœ“ 193 acc, acc_norm
hendrycksTest-high_school_macroeconomics βœ“ βœ“ βœ“ 390 acc, acc_norm
hendrycksTest-high_school_mathematics βœ“ βœ“ βœ“ 270 acc, acc_norm
hendrycksTest-high_school_microeconomics βœ“ βœ“ βœ“ 238 acc, acc_norm
hendrycksTest-high_school_physics βœ“ βœ“ βœ“ 151 acc, acc_norm
hendrycksTest-high_school_psychology βœ“ βœ“ βœ“ 545 acc, acc_norm
hendrycksTest-high_school_statistics βœ“ βœ“ βœ“ 216 acc, acc_norm
hendrycksTest-high_school_us_history βœ“ βœ“ βœ“ 204 acc, acc_norm
hendrycksTest-high_school_world_history βœ“ βœ“ βœ“ 237 acc, acc_norm
hendrycksTest-human_aging βœ“ βœ“ βœ“ 223 acc, acc_norm
hendrycksTest-human_sexuality βœ“ βœ“ βœ“ 131 acc, acc_norm
hendrycksTest-international_law βœ“ βœ“ βœ“ 121 acc, acc_norm
hendrycksTest-jurisprudence βœ“ βœ“ βœ“ 108 acc, acc_norm
hendrycksTest-logical_fallacies βœ“ βœ“ βœ“ 163 acc, acc_norm
hendrycksTest-machine_learning βœ“ βœ“ βœ“ 112 acc, acc_norm
hendrycksTest-management βœ“ βœ“ βœ“ 103 acc, acc_norm
hendrycksTest-marketing βœ“ βœ“ βœ“ 234 acc, acc_norm
hendrycksTest-medical_genetics βœ“ βœ“ βœ“ 100 acc, acc_norm
hendrycksTest-miscellaneous βœ“ βœ“ βœ“ 783 acc, acc_norm
hendrycksTest-moral_disputes βœ“ βœ“ βœ“ 346 acc, acc_norm
hendrycksTest-moral_scenarios βœ“ βœ“ βœ“ 895 acc, acc_norm
hendrycksTest-nutrition βœ“ βœ“ βœ“ 306 acc, acc_norm
hendrycksTest-philosophy βœ“ βœ“ βœ“ 311 acc, acc_norm
hendrycksTest-prehistory βœ“ βœ“ βœ“ 324 acc, acc_norm
hendrycksTest-professional_accounting βœ“ βœ“ βœ“ 282 acc, acc_norm
hendrycksTest-professional_law βœ“ βœ“ βœ“ 1534 acc, acc_norm
hendrycksTest-professional_medicine βœ“ βœ“ βœ“ 272 acc, acc_norm
hendrycksTest-professional_psychology βœ“ βœ“ βœ“ 612 acc, acc_norm
hendrycksTest-public_relations βœ“ βœ“ βœ“ 110 acc, acc_norm
hendrycksTest-security_studies βœ“ βœ“ βœ“ 245 acc, acc_norm
hendrycksTest-sociology βœ“ βœ“ βœ“ 201 acc, acc_norm
hendrycksTest-us_foreign_policy βœ“ βœ“ βœ“ 100 acc, acc_norm
hendrycksTest-virology βœ“ βœ“ βœ“ 166 acc, acc_norm
hendrycksTest-world_religions βœ“ βœ“ βœ“ 171 acc, acc_norm
wmt14-en-fr βœ“ 3003 bleu, chrf, ter
wmt14-fr-en βœ“ 3003 bleu, chrf, ter
wmt16-en-ro βœ“ 1999 bleu, chrf, ter
wmt16-ro-en βœ“ 1999 bleu, chrf, ter
wmt16-de-en βœ“ 2999 bleu, chrf, ter
wmt16-en-de βœ“ 2999 bleu, chrf, ter
wmt20-cs-en βœ“ 664 bleu, chrf, ter
wmt20-de-en βœ“ 785 bleu, chrf, ter
wmt20-de-fr βœ“ 1619 bleu, chrf, ter
wmt20-en-cs βœ“ 1418 bleu, chrf, ter
wmt20-en-de βœ“ 1418 bleu, chrf, ter
wmt20-en-iu βœ“ 2971 bleu, chrf, ter
wmt20-en-ja βœ“ 1000 bleu, chrf, ter
wmt20-en-km βœ“ 2320 bleu, chrf, ter
wmt20-en-pl βœ“ 1000 bleu, chrf, ter
wmt20-en-ps βœ“ 2719 bleu, chrf, ter
wmt20-en-ru βœ“ 2002 bleu, chrf, ter
wmt20-en-ta βœ“ 1000 bleu, chrf, ter
wmt20-en-zh βœ“ 1418 bleu, chrf, ter
wmt20-fr-de βœ“ 1619 bleu, chrf, ter
wmt20-iu-en βœ“ 2971 bleu, chrf, ter
wmt20-ja-en βœ“ 993 bleu, chrf, ter
wmt20-km-en βœ“ 2320 bleu, chrf, ter
wmt20-pl-en βœ“ 1001 bleu, chrf, ter
wmt20-ps-en βœ“ 2719 bleu, chrf, ter
wmt20-ru-en βœ“ 991 bleu, chrf, ter
wmt20-ta-en βœ“ 997 bleu, chrf, ter
wmt20-zh-en βœ“ 2000 bleu, chrf, ter
iwslt17-en-ar βœ“ 1460 bleu, chrf, ter
iwslt17-ar-en βœ“ 1460 bleu, chrf, ter
anagrams1 βœ“ 10000 acc
anagrams2 βœ“ 10000 acc
cycle_letters βœ“ 10000 acc
random_insertion βœ“ 10000 acc
reversed_words βœ“ 10000 acc
pile_arxiv βœ“ βœ“ 2407 word_perplexity, byte_perplexity, bits_per_byte
pile_books3 βœ“ βœ“ 269 word_perplexity, byte_perplexity, bits_per_byte
pile_bookcorpus2 βœ“ βœ“ 28 word_perplexity, byte_perplexity, bits_per_byte
pile_dm-mathematics βœ“ βœ“ 1922 word_perplexity, byte_perplexity, bits_per_byte
pile_enron βœ“ βœ“ 1010 word_perplexity, byte_perplexity, bits_per_byte
pile_europarl βœ“ βœ“ 157 word_perplexity, byte_perplexity, bits_per_byte
pile_freelaw βœ“ βœ“ 5101 word_perplexity, byte_perplexity, bits_per_byte
pile_github βœ“ βœ“ 18195 word_perplexity, byte_perplexity, bits_per_byte
pile_gutenberg βœ“ βœ“ 80 word_perplexity, byte_perplexity, bits_per_byte
pile_hackernews βœ“ βœ“ 1632 word_perplexity, byte_perplexity, bits_per_byte
pile_nih-exporter βœ“ βœ“ 1884 word_perplexity, byte_perplexity, bits_per_byte
pile_opensubtitles βœ“ βœ“ 642 word_perplexity, byte_perplexity, bits_per_byte
pile_openwebtext2 βœ“ βœ“ 32925 word_perplexity, byte_perplexity, bits_per_byte
pile_philpapers βœ“ βœ“ 68 word_perplexity, byte_perplexity, bits_per_byte
pile_pile-cc βœ“ βœ“ 52790 word_perplexity, byte_perplexity, bits_per_byte
pile_pubmed-abstracts βœ“ βœ“ 29895 word_perplexity, byte_perplexity, bits_per_byte
pile_pubmed-central βœ“ βœ“ 5911 word_perplexity, byte_perplexity, bits_per_byte
pile_stackexchange βœ“ βœ“ 30378 word_perplexity, byte_perplexity, bits_per_byte
pile_uspto βœ“ βœ“ 11415 word_perplexity, byte_perplexity, bits_per_byte
pile_ubuntu-irc βœ“ βœ“ 22 word_perplexity, byte_perplexity, bits_per_byte
pile_wikipedia βœ“ βœ“ 17511 word_perplexity, byte_perplexity, bits_per_byte
pile_youtubesubtitles βœ“ βœ“ 342 word_perplexity, byte_perplexity, bits_per_byte
blimp_adjunct_island βœ“ 1000 acc
blimp_anaphor_gender_agreement βœ“ 1000 acc
blimp_anaphor_number_agreement βœ“ 1000 acc
blimp_animate_subject_passive βœ“ 1000 acc
blimp_animate_subject_trans βœ“ 1000 acc
blimp_causative βœ“ 1000 acc
blimp_complex_NP_island βœ“ 1000 acc
blimp_coordinate_structure_constraint_complex_left_branch βœ“ 1000 acc
blimp_coordinate_structure_constraint_object_extraction βœ“ 1000 acc
blimp_determiner_noun_agreement_1 βœ“ 1000 acc
blimp_determiner_noun_agreement_2 βœ“ 1000 acc
blimp_determiner_noun_agreement_irregular_1 βœ“ 1000 acc
blimp_determiner_noun_agreement_irregular_2 βœ“ 1000 acc
blimp_determiner_noun_agreement_with_adj_2 βœ“ 1000 acc
blimp_determiner_noun_agreement_with_adj_irregular_1 βœ“ 1000 acc
blimp_determiner_noun_agreement_with_adj_irregular_2 βœ“ 1000 acc
blimp_determiner_noun_agreement_with_adjective_1 βœ“ 1000 acc
blimp_distractor_agreement_relational_noun βœ“ 1000 acc
blimp_distractor_agreement_relative_clause βœ“ 1000 acc
blimp_drop_argument βœ“ 1000 acc
blimp_ellipsis_n_bar_1 βœ“ 1000 acc
blimp_ellipsis_n_bar_2 βœ“ 1000 acc
blimp_existential_there_object_raising βœ“ 1000 acc
blimp_existential_there_quantifiers_1 βœ“ 1000 acc
blimp_existential_there_quantifiers_2 βœ“ 1000 acc
blimp_existential_there_subject_raising βœ“ 1000 acc
blimp_expletive_it_object_raising βœ“ 1000 acc
blimp_inchoative βœ“ 1000 acc
blimp_intransitive βœ“ 1000 acc
blimp_irregular_past_participle_adjectives βœ“ 1000 acc
blimp_irregular_past_participle_verbs βœ“ 1000 acc
blimp_irregular_plural_subject_verb_agreement_1 βœ“ 1000 acc
blimp_irregular_plural_subject_verb_agreement_2 βœ“ 1000 acc
blimp_left_branch_island_echo_question βœ“ 1000 acc
blimp_left_branch_island_simple_question βœ“ 1000 acc
blimp_matrix_question_npi_licensor_present βœ“ 1000 acc
blimp_npi_present_1 βœ“ 1000 acc
blimp_npi_present_2 βœ“ 1000 acc
blimp_only_npi_licensor_present βœ“ 1000 acc
blimp_only_npi_scope βœ“ 1000 acc
blimp_passive_1 βœ“ 1000 acc
blimp_passive_2 βœ“ 1000 acc
blimp_principle_A_c_command βœ“ 1000 acc
blimp_principle_A_case_1 βœ“ 1000 acc
blimp_principle_A_case_2 βœ“ 1000 acc
blimp_principle_A_domain_1 βœ“ 1000 acc
blimp_principle_A_domain_2 βœ“ 1000 acc
blimp_principle_A_domain_3 βœ“ 1000 acc
blimp_principle_A_reconstruction βœ“ 1000 acc
blimp_regular_plural_subject_verb_agreement_1 βœ“ 1000 acc
blimp_regular_plural_subject_verb_agreement_2 βœ“ 1000 acc
blimp_sentential_negation_npi_licensor_present βœ“ 1000 acc
blimp_sentential_negation_npi_scope βœ“ 1000 acc
blimp_sentential_subject_island βœ“ 1000 acc
blimp_superlative_quantifiers_1 βœ“ 1000 acc
blimp_superlative_quantifiers_2 βœ“ 1000 acc
blimp_tough_vs_raising_1 βœ“ 1000 acc
blimp_tough_vs_raising_2 βœ“ 1000 acc
blimp_transitive βœ“ 1000 acc
blimp_wh_island βœ“ 1000 acc
blimp_wh_questions_object_gap βœ“ 1000 acc
blimp_wh_questions_subject_gap βœ“ 1000 acc
blimp_wh_questions_subject_gap_long_distance βœ“ 1000 acc
blimp_wh_vs_that_no_gap βœ“ 1000 acc
blimp_wh_vs_that_no_gap_long_distance βœ“ 1000 acc
blimp_wh_vs_that_with_gap βœ“ 1000 acc
blimp_wh_vs_that_with_gap_long_distance βœ“ 1000 acc

Usage

Evaluate a task

Additional arguments can be provided to the model constructor using the --model_args flag. Most importantly, the gpt2 model can be used to load an arbitrary HuggingFace model as follows:

python main.py \
	--model gpt2 \
	--model_args pretrained=EleutherAI/gpt-neo-1.3B \
	--device 0 \
	--tasks lambada,hellaswag \
	--num_fewshot 2

To inspect what the LM inputs look like, you can run the following command:

python write_out.py \
	--tasks all_tasks \
	--num_fewshot 5 \
	--num_examples 10 \
	--output_base_path /path/to/output/folder

This will write out one text file for each task.

Test Set Decontamination

For more details see the decontamination guide.

The directory provided with the "--decontamination_ngrams_path" argument should contain the ngram files and info.json. See the above guide for ngram generation for the pile, this could be adapted for other training sets.

python main.py \
    --model gpt2 \
    --device 0 \
    --tasks sciq \
    --decontamination_ngrams_path path/containing/training/set/ngrams

Code Structure

There are two major components of the library:

  1. LMs (language models), e.g. GPT-2, GPT-3
  2. Tasks, e.g. MNLI, RTE, SQuAD (coming soon)

Both LMs (lm_eval.models) and Tasks (lm_eval.tasks) are kept in a registry data structure, for easy CLI instantiation.

If you want to extend either models or tasks, simply add a new LM or Task subclass, and decorate with the registry decorator.

The GPT-3 Evaluations Project tracks our progress implementing new tasks. Right now, we are focused on getting all the datasets loaded so that we can dedupe against the training data. Implementing the actual evaluations is nice but not necessary at the current moment.

Task Versioning

To help improve reproducibility, all tasks have a VERSION field. When run from the command line, this is reported in a column in the table, or in the "version" field in the evaluator return dict. The purpose of the version is so that if the task definition changes (i.e to fix a bug), then we can know exactly which metrics were computed using the old buggy implementation to avoid unfair comparisons. To enforce this, there are unit tests that make sure the behavior of all tests remains the same as when they were first implemented. Task versions start at 0, and each time a breaking change is made, the version is incremented by one.

When reporting eval harness results, please also report the version of each task. This can be done either with a separate column in the table, or by reporting the task name with the version appended as such: taskname-v0.

Description

1. LM Evaluation

Given an LM, we want to evaluate it on a wide range of NLU tasks. We should at least cover the set of tasks in the GPT-3 paper, and any other tasks/benchmarks that are relevant. We will follow the GPT-3 format of a) zero-shot, b) one-shot, c) few-shot evaluation.

To do this, we need 3 components:

  • Data downloader (shared with later sections, potentially needs to be directly linked to the latter 2 components)
  • Task formatter
  • Task evaluator

The data downloader should download data for the relevant tasks.

  • We should heavily rely on Hugging Face's NLP for this. They are already doing most of the work with handling data scripts/caching.
  • Optionally, we can rely directly on HF-NLP's caching, but that makes it awkward to handle non-HF-NLP datasets. Otherwise, we can just write them out to .jsonl. My feeling is that NLU data storage will be a drop in the bucket compared to LM data.
  • Where we're not using HF-NLP, we can keep the data in the raw format (.jsonl, tsv, etc) and let the other components handle transforming it.

The task formatter formats the task input data into an LM-usable format.

  • We should potentially support multiple formats for a given task, e.g. some formats may be better or worse suited for LM evaluation. See also: prompt-engineering
  • The task formatter should also support zero/one/few-shot packing of training examples into an input. This may require weird interactions with the tokenizer for dealing with max-token issues.

The task evaluator scores a task.

  • In essence, we want to generation output predictions for all our input examples, and feed them into some function that pops out a score (or scores) An alternative approach is to collect the output logits and score them against the expected set of outputs.
  • Some tasks have weird evaluation schemes, so we should make this as general as possible.
  • Will thus likely have to be closely tied with the formatter.
  • Likewise, we should take advantage of HF-NLP's metrics. We might as well provide a sufficiently general API for the model to support OpenAI API as well. This can double up as an effort to reproduce the OpenAI NLU results.

2. Removing val/test data from LM training set

With the data downloader in place, we simply need to (1) expose the val/test examples, and (2) remove them from the training set.

  • Arguably, (2) should be handled by LM preprocessing in a more general way. There are probably non-NLU-eval cases where we want to remove some specific data from training.
  • Depending on how exactly we do the val/test removal, we may want to format the same example multiple ways to ensure that they don't get leaked into the training set in a slightly tweaked format.
  • Thought experiment: SQuAD is based largely on Wikipedia. What exactly would we want to remove from the LM?
  • [GPT-3]: In GPT-3, they attempted to remove val/test from their LM set, but there was a bug that caused leakage. So they ended up doing the opposite: removing overlaps from the LM set from the val/test. Funky.
  • [GPT-3]: See page 30 and Appendix C for details. They do some funky n-gram based search and removal. We should think about whether we want to follow their protocol exactly

3. Adding task training data to LM training set

This part is the easiest. I guess we just write out some text files containing the training data? We can let the usual LM preprocessing pipeline handle it from there.

About

A framework for few-shot evaluation of autoregressive language models.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 98.8%
  • C++ 1.2%