08 Oct 21:06

baberabb

0845b58

v0.4.5 Latest

Latest

lm-eval v0.4.5 Release Notes

New Additions

Prototype Support for Vision Language Models (VLMs)

We're excited to introduce prototype support for Vision Language Models (VLMs) in this release, using model types hf-multimodal and vllm-vlm. This allows for evaluation of models that can process text and image inputs and produce text outputs. Currently we have added support for the MMMU (mmmu_val) task and we welcome contributions and feedback from the community!

New VLM-Specific Arguments

VLM models can be configured with several new arguments within --model_args to support their specific requirements:

max_images (int): Set the maximum number of images for each prompt.
interleave (bool): Determines the positioning of image inputs. When True (default) images are interleaved with the text. When False all images are placed at the front of the text. This is model dependent.

hf-multimodal specific args:

image_token_id (int) or image_string (str): Specifies a custom token or string for image placeholders. For example, Llava models expect an "<image>" string to indicate the location of images in the input, while Qwen2-VL models expect an "<|image_pad|>" sentinel string instead. This will be inferred based on model configuration files whenever possible, but we recommend confirming that an override is needed when testing a new model family
convert_img_format (bool): Whether to convert the images to RGB format.

Example usage:

lm_eval --model hf-multimodal --model_args pretrained=llava-hf/llava-1.5-7b-hf,attn_implementation=flash_attention_2,max_images=1,interleave=True,image_string=<image> --tasks mmmu_val --apply_chat_template
lm_eval --model vllm-vlm --model_args pretrained=llava-hf/llava-1.5-7b-hf,max_images=1,interleave=True --tasks mmmu_val --apply_chat_template

Important considerations

Chat Template: Most VLMs require the --apply_chat_template flag to ensure proper input formatting according to the model's expected chat template.
Some VLM models are limited to processing a single image per prompt. For these models, always set max_images=1. Additionally, certain models expect image placeholders to be non-interleaved with the text, requiring interleave=False.
Performance and Compatibility: When working with VLMs, be mindful of potential memory constraints and processing times, especially when handling multiple images or complex tasks.

Tested VLM Models

We have currently most notably tested the implementation with the following models:

llava-hf/llava-1.5-7b-hf
llava-hf/llava-v1.6-mistral-7b-hf
Qwen/Qwen2-VL-2B-Instruct
HuggingFaceM4/idefics2 (requires the latest transformers from source)

New Tasks

Several new tasks have been contributed to the library for this version!

New tasks as of v0.4.5 include:

Open Arabic LLM Leaderboard tasks, contributed by @shahrzads @Malikeh97 in #2232
MMMU (validation set), by @haileyschoelkopf @baberabb @lintangsutawika in #2243
TurkishMMLU by @ArdaYueksel in #2283
PortugueseBench, SpanishBench, GalicianBench, BasqueBench, and CatalanBench aggregate multilingual tasks in #2153 #2154 #2155 #2156 #2157 by @zxcvuser and others

As well as several slight fixes or changes to existing tasks (as noted via the incrementing of versions).

Backwards Incompatibilities

Finalizing `group` versus `tag` split

We've now fully deprecated the use of group keys directly within a task's configuration file. The appropriate key to use is now solely tag for many cases. See the v0.4.4 patchnotes for more info on migration, if you have a set of task YAMLs maintained outside the Eval Harness repository.

Handling of Causal vs. Seq2seq backend in HFLM

In HFLM, logic specific to handling inputs for Seq2seq (encoder-decoder models like T5) versus Causal (decoder-only autoregressive models, and the vast majority of current LMs) models previously hinged on a check for self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM. Some users may want to use causal model behavior, but set self.AUTO_MODEL_CLASS to a different factory class, such as transformers.AutoModelForVision2Seq.

As a result, those users who subclass HFLM but do not call HFLM.__init__() may now also need to set the self.backend attribute to either "causal" or "seq2seq" during initialization themselves.

While this should not affect a large majority of users, for those who subclass HFLM in potentially advanced ways, see #2353 for the full set of changes.

Future Plans

We intend to further expand our multimodal support to a wider set of vision-language tasks, as well as a broader set of model types, and are actively seeking user feedback!

Thanks, the LM Eval Harness team (@baberabb @haileyschoelkopf @lintangsutawika)

What's Changed

Add Open Arabic LLM Leaderboard Benchmarks (Full and Light Version) by @Malikeh97 in #2232
Multimodal prototyping by @lintangsutawika in #2243
Update README.md by @SYusupov in #2297
remove comma by @baberabb in #2315
Update neuron backend by @dacorvo in #2314
Fixed dummy model by @Am1n3e in #2339
Add a note for missing dependencies by @eldarkurtic in #2336
squad v2: load metric with evaluate by @baberabb in #2351
fix writeout script by @baberabb in #2350
Treat tags in python tasks the same as yaml tasks by @giuliolovisotto in #2288
change group to tags in task eus_exams task configs by @baberabb in #2320
change glianorex to test split by @baberabb in #2332
mmlu-pro: add newlines to task descriptions (not leaderboard) by @baberabb in #2334
Added TurkishMMLU to LM Evaluation Harness by @ArdaYueksel in #2283
add mmlu readme by @baberabb in #2282
openai: better error messages; fix greedy matching by @baberabb in #2327
fix some bugs of mmlu by @eyuansu62 in #2299
Add new benchmark: Portuguese bench by @zxcvuser in #2156
Fix missing key in custom task loading. by @giuliolovisotto in #2304
Add new benchmark: Spanish bench by @zxcvuser in #2157
Add new benchmark: Galician bench by @zxcvuser in #2155
Add new benchmark: Basque bench by @zxcvuser in #2153
Add new benchmark: Catalan bench by @zxcvuser in #2154
fix tests by @baberabb in #2380
Hotfix! by @baberabb in #2383
Solution for CSAT-QA tasks evaluation by @KyujinHan in #2385
LingOly - Fixing scoring bugs for smaller models by @am-bean in #2376
Fix float limit override by @cjluo-omniml in #2325
[API] tokenizer: add trust-remote-code by @baberabb in #2372
HF: switch conditional checks to self.backend from AUTO_MODEL_CLASS by @baberabb in #2353
max_images are passed on to vllms limit_mm_per_prompt by @baberabb in #2387
Fix Llava-1.5-hf ; Update to version 0.4.5 by @haileyschoelkopf in #2388
Bump version to v0.4.5 by @haileyschoelkopf in #2389

New Contributors

@Malikeh97 made their first contribution in #2232
@SYusupov made their first contribution in #2297
@dacorvo made their first contribution in #2314
@eldarkurtic made their first contribution in #2336
@giuliolovisotto made their first contribution in #2288
@ArdaYueksel made their first contribution in #2283
@zxcvuser made their first contribution in #2156
@KyujinHan made their first contribution in #2385
@cjluo-omniml made their first contribution in #2325

Full Changelog: https://github.com/Eleu...

Contributors

dacorvo, giuliolovisotto, and 14 other contributors

Assets 2

05 Sep 15:13

haileyschoelkopf

v0.4.4

543617f

v0.4.4

lm-eval v0.4.4 Release Notes

New Additions

This release includes the Open LLM Leaderboard 2 official task implementations! These can be run by using --tasks leaderboard. Thank you to the HF team (@clefourrier, @NathanHB , @KonradSzafer, @lozovskaya) for contributing these -- you can read more about their Open LLM Leaderboard 2 release here.
API support is overhauled! Now: support for concurrent requests, chat templates, tokenization, batching and improved customization. This makes API support both more generalizable to new providers and should dramatically speed up API model inference.
- The url can be specified by passing the base_url to --model_args, for example, base_url=http://localhost:8000/v1/completions; concurrent requests are controlled with the num_concurrent argument; tokenization is controlled with tokenized_requests.
- Other arguments (such as top_p, top_k, etc.) can be passed to the API using --gen_kwargs as usual.
- Note: Instruct-tuned models, not just base models, can be used with local-completions using --apply_chat_template (either with or without tokenized_requests).
  - They can also be used with local-chat-completions (for e.g. with a OpenAI Chat API endpoint), but only the former supports loglikelihood tasks (e.g. multiple-choice). This is because ChatCompletion style APIs generally do not provide access to logits on prompt/input tokens, preventing easy measurement of multi-token continuations' log probabilities.
- example with OpenAI completions API (using vllm serve):
  - lm_eval --model local-completions --model_args model=meta-llama/Meta-Llama-3.1-8B-Instruct,num_concurrent=10,tokenized_requests=True,tokenizer_backend=huggingface,max_length=4096 --apply_chat_template --batch_size 1 --tasks mmlu
- example with chat API:
  - lm_eval --model local-chat-completions --model_args model=meta-llama/Meta-Llama-3.1-8B-Instruct,num_concurrent=10 --apply_chat_template --tasks gsm8k
- We recommend evaluating Llama-3.1-405B models via serving them with vllm then running under local-completions!
We've reworked the Task Grouping system to make it clearer when and when not to report an aggregated average score across multiple subtasks. See #Backwards Incompatibilities below for more information on changes and migration instructions.
A combination of data-parallel and model-parallel (using HF's device_map functionality for "naive" pipeline parallel) inference using --model hf is now supported, thank you to @NathanHB and team!

Other new additions include a number of miscellaneous bugfixes and much more. Thank you to all contributors who helped out on this release!

New Tasks

A number of new tasks have been contributed to the library.

As a further discoverability improvement, lm_eval --tasks list now shows all tasks, tags, and groups in a prettier format, along with (if applicable) where to find the associated config file for a task or group! Thank you to @anthony-dipofi for working on this.

New tasks as of v0.4.4 include:

Open LLM Leaderboard 2 tasks--see above!
Inverse Scaling tasks, contributed by @h-albert-lee in #1589
Unitxt tasks reworked by @elronbandel in #1933
MMLU-SR, contributed by @SkySuperCat in #2032
IrokoBench, contributed by @JessicaOjo @IsraelAbebe in #2042
MedConceptQA, contributed by @Ofir408 in #2010
MMLU Pro, contributed by @ysjprojects in #1961
GSM-Plus, contributed by @ysjprojects in #2103
Lingoly, contributed by @am-bean in #2198
GSM8k and Asdiv settings matching the Llama 3.1 evaluation settings, contributed by @Cameron7195 in #2215 #2236
TMLU, contributed by @adamlin120 in #2093
Mela, contributed by @Geralt-Targaryen in #1970

Backwards Incompatibilities

`tag`s versus `group`s, and how to migrate

Previously, we supported the ability to group a set of tasks together, generally for two purposes: 1) to have an easy-to-call shortcut for a set of tasks one might want to frequently run simultaneously, and 2) to allow for "parent" tasks like mmlu to aggregate and report a unified score across a set of component "subtasks".

There were two ways to add a task to a given group name: 1) to provide (a list of) values to the group field in a given subtask's config file:

# this is a *task* yaml file.
group: group_name1
task: my_task1
# rest of task config goes here...

or 2) to define a "group config file" and specify a group along with its constituent subtasks:

# this is a group's yaml file
group: group_name1
task:
  - subtask_name1
  - subtask_name2
  # ...

These would both have the same effect of reporting an averaged metric for group_name1 when calling lm_eval --tasks group_name1. However, in use-case 1) (simply registering a shorthand for a list of tasks one is interested in), reporting an aggregate score can be undesirable or ill-defined.

We've now separated out these two use-cases ("shorthand" groupings and hierarchical subtask collections) into a tag and group property separately!

To register a shorthand (now called a tag), simply change the group field name within your task's config to be tag (group_alias keys will no longer be supported in task configs.):

# this is a *task* yaml file.
tag: tag_name1
task: my_task1
# rest of task config goes here...

Group config files may remain as is if aggregation is not desired. To opt-in to reporting aggregated scores across a group's subtasks, add the following to your group config file:

# this is a group's yaml file
group: group_name1
task:
  - subtask_name1
  - subtask_name2
  # ...
 ### New! Needed to turn on aggregation ###
 aggregate_metric_list:
  - metric: acc # placeholder. Note that all subtasks in this group must report an `acc` metric key
  - weight_by_size: True # whether one wishes to report *micro* or *macro*-averaged scores across subtasks. Defaults to `True`.

Please see our documentation here for more information. We apologize for any headaches this migration may create--however, we believe separating out these two functionalities will make it less likely for users to encounter confusion or errors related to mistaken undesired aggregation.

Future Plans

We're planning to make more planning documents public and standardize on (likely) 1 new PyPI release per month! Stay tuned.

Thanks, the LM Eval Harness team (@haileyschoelkopf @lintangsutawika @baberabb)

What's Changed

fix wandb logger module import in example by @ToluClassics in #2041
Fix strip whitespace filter by @NathanHB in #2048
Gemma-2 also needs default add_bos_token=True by @haileyschoelkopf in #2049
Update trust_remote_code for Hellaswag by @haileyschoelkopf in #2029
Adds Open LLM Leaderboard Taks by @NathanHB in #2047
#1442 inverse scaling tasks implementation by @h-albert-lee in #1589
Fix TypeError in samplers.py by converting int to str by @uni2237 in #2074
Group agg rework by @lintangsutawika in #1741
Fix printout tests (N/A expected for stderrs) by @haileyschoelkopf in #2080
Easier unitxt tasks loading and removal of unitxt library dependancy by @elronbandel in #1933
Allow gating EvaluationTracker HF Hub results; customizability by @NathanHB in #2051
Minor doc fix: leaderboard README.md missing mmlu-pro group and task by @pankajarm in #2075
Revert missing utf-8 encoding for logged sample files (#2027) by @haileyschoelkopf in #2082
Update utils.py by @lintangsutawika in #2085
batch_size may be str if 'auto' is specified by @meg-huggingface in #2084
Prettify lm_eval --tasks list by @anthony-dipofi in #1929
Suppress noisy RougeScorer logs in truthfulqa_gen by @haileyschoelkopf in #2090
Update default.yaml by @waneon in #2092
Add new dataset MMLU-SR tasks by @SkySuperCat in #2032
Irokobench: Benchmark Dataset for African languages by @JessicaOjo in #2042
docs: remove trailing sentence from contribution doc by @nathan-weinberg in #2098
Added MedConceptsQA Benchmark by @Ofir408 in #2010
Also force BOS for "recurrent_gemma" and other Gemma model types by @haileyschoelkopf in #2105
formatting by @lintangsutawika in #2104
docs: align local test command to match CI by @nathan-weinberg in https://gith...

Contributors

pankajarm, antonpolishko, and 36 other contributors

Assets 2

01 Jul 14:00

haileyschoelkopf

v0.4.3

3fa4fd7

v0.4.3

lm-eval v0.4.3 Release Notes

We're releasing a new version of LM Eval Harness for PyPI users at long last. We intend to release new PyPI versions more frequently in the future.

New Additions

The big new feature is the often-requested Chat Templating, contributed by @KonradSzafer @clefourrier @NathanHB and also worked on by a number of other awesome contributors!

You can now run using a chat template with --apply_chat_template and a system prompt of your choosing using --system_instruction "my sysprompt here". The --fewshot_as_multiturn flag can control whether each few-shot example in context is a new conversational turn or not.

This feature is currently only supported for model types hf and vllm but we intend to gather feedback on improvements and also extend this to other relevant models such as APIs.

There's a lot more to check out, including:

Logging results to the HF Hub if desired using --hf_hub_log_args, by @KonradSzafer and team!
NeMo model support by @sergiopperez !
Anthropic Chat API support by @tryuman !
DeepSparse and SparseML model types by @mgoin !
Handling of delta-weights in HF models, by @KonradSzafer !
LoRA support for VLLM, by @bcicc !
Fixes to PEFT modules which add new tokens to the embedding layers, by @mapmeld !
Fixes to handling of BOS tokens in multiple-choice loglikelihood settings, by @djstrong !
The use of custom Sampler subclasses in tasks, by @LSinev !
The ability to specify "hardcoded" few-shot examples more cleanly, by @clefourrier !
Support for Ascend NPUs (--device npu) by @statelesshz, @zhabuye, @jiaqiw09 and others!
Logging of higher_is_better in results tables for clearer understanding of eval metrics by @zafstojano !
extra info logged about models, including info about tokenizers, chat templating, and more, by @artemorloff @djstrong and others!
Miscellaneous bug fixes! And many more great contributions we weren't able to list here.

New Tasks

We had a number of new tasks contributed. A listing of subfolders and a brief description of the tasks contained in them can now be found at lm_eval/tasks/README.md. Hopefully this will be a useful step to help users to locate the definitions of relevant tasks more easily, by first visiting this page and then locating the appropriate README.md within a given lm_eval/tasks subfolder, for further info on each task contained within a given folder. Thank you to @anthonydipofi @Harryalways317 @nairbv @sepiatone and others for working on this and giving feedback!

Without further ado, the tasks:

ACLUE, a benchmark for Ancient Chinese understanding, by @haonan-li
BasqueGlue and EusExams, two Basque-language tasks by @juletx
TMMLU+, an evaluation for Traditional Chinese, contributed by @ZoneTwelve
XNLIeu, a Basque version of XNLI, by @juletx
Pile-10K, a perplexity eval taken from a subset of the Pile's validation set, contributed by @mukobi
FDA, SWDE, and Squad-Completion zero-shot tasks by @simran-arora and team
Added back the hendrycks_math task, the MATH task using the prompt and answer parsing from the original Hendrycks et al. MATH paper rather than Minerva's prompt and parsing
COPAL-ID, a natively-Indonesian commonsense benchmark, contributed by @Erland366
tinyBenchmarks variants of the Open LLM Leaderboard 1 tasks, by @LucWeber and team!
Glianorex, a benchmark for testing performance on fictional medical questions, by @maximegmd
New FLD (formal logic) task variants by @MorishT
Improved translations of Lambada Multilingual tasks, added by @zafstojano
NoticIA, a Spanish summarization dataset by @ikergarcia1996
The Paloma perplexity benchmark, added by @zafstojano
We've removed the AMMLU dataset due to concerns about auto-translation quality.
Added the localized, not translated, ArabicMMLU dataset, contributed by @Yazeed7 !
BertaQA, a Basque cultural knowledge benchmark, by @juletx
New machine-translated ARC-C datasets by @jonabur !
CommonsenseQA, in a prompt format following Llama, by @murphybrendan
...

Backwards Incompatibilities

The save format for logged results has now changed.

output files will now be written to

{output_path}/{sanitized_model_name}/results_YYYY-MM-DDTHH-MM-SS.xxxxx.json if --output_path is set, and
{output_path}/{sanitized_model_name}/samples_{task_name}_YYYY-MM-DDTHH-MM-SS.xxxxx.jsonl for each task's samples if --log_samples is set.

e.g. outputs/gpt2/results_2024-06-28T00-00-00.00001.json and outputs/gpt2/samples_lambada_openai_2024-06-28T00-00-00.00001.jsonl.

See #1926 for utilities which may help to work with these new filenames.

Future Plans

In general, we'll be doing our best to keep up with the strong interest and large number of contributions we've seen coming in!

The official Open LLM Leaderboard 2 tasks will be landing soon in the Eval Harness main branch and subsequently in v0.4.4 on PyPI!
The fact that groups of tasks by-default attempt to report an aggregated score across constituent subtasks has been a sharp edge. We are finishing up some internal reworking to distinguish between groups of tasks that do report aggregate scores (think mmlu) versus tags which simply are a convenient shortcut to call a bunch of tasks one might want to run at once (think the pythia grouping which merely represents a collection of tasks one might want to gather results on each of all at once but where averaging doesn't make sense).
We'd also like to improve the API model support in the Eval Harness from its current state.
More to come!

Thank you to everyone who's contributed to or used the library!

Thanks, @haileyschoelkopf @lintangsutawika

What's Changed

use BOS token in loglikelihood by @djstrong in #1588
Revert "Patch for Seq2Seq Model predictions" by @haileyschoelkopf in #1601
fix gen_kwargs arg reading by @artemorloff in #1607
fix until arg processing by @artemorloff in #1608
Fixes to Loglikelihood prefix token / VLLM by @haileyschoelkopf in #1611
Add ACLUE task by @haonan-li in #1614
OpenAI Completions -- fix passing of unexpected 'until' arg by @haileyschoelkopf in #1612
add logging of model args by @baberabb in #1619
Add vLLM FAQs to README (#1625) by @haileyschoelkopf in #1633
peft Version Assertion by @LameloBally in #1635
Seq2seq fix by @lintangsutawika in #1604
Integration of NeMo models into LM Evaluation Harness library by @sergiopperez in #1598
Fix conditional import for Nemo LM class by @haileyschoelkopf in #1641
Fix SuperGlue's ReCoRD task following regression in v0.4 refactoring by @orsharir in #1647
Add Latxa paper evaluation tasks for Basque by @juletx in #1654
Fix CLI --batch_size arg for openai-completions/local-completions by @mgoin in #1656
Patch QQP prompt (#1648 ) by @haileyschoelkopf in #1661
TMMLU+ implementation by @ZoneTwelve in #1394
Anthropic Chat API by @tryumanshow in #1594
correction bug #1664 by @nicho2 in #1670
Signpost potential bugs / unsupported ops in MPS backend by @haileyschoelkopf in #1680
Add delta weights model loading by @KonradSzafer in #1712
Add neuralmagic models for sparseml and deepsparse by @mgoin in #1674
Improvements to run NVIDIA NeMo models on LM Evaluation Harness by @sergiopperez in #1699
Adding retries and rate limit to toxicity tasks by @sator-labs in #1620
reference --tasks list in README by @nairbv in #1726
Add XNLIeu: a dataset for cross-lingual NLI in Basque by @juletx in #1694
Fix Parameter Propagation for Tasks that have include by @lintangsutawika in #1749
Support individual scrolls datasets by @giorgossideris in #1740
Add filter registry decorator by @lozhn in #1750
remove duplicated num_fewshot: 0 by @chujiezheng in #1769
Pile 10k new task by @mukobi in #1758
Fix m_arc choices by @jordane95 in #1760
upload new tasks by @simran-arora in https://github.com/EleutherAI/lm-eva...

Contributors

orsharir, nairbv, and 54 other contributors

Assets 2

18 Mar 13:07

haileyschoelkopf

v0.4.2

4600d6b

v0.4.2

lm-eval v0.4.2 Release Notes

We are releasing a new minor version of lm-eval for PyPI users! We've been very happy to see continued usage of the lm-evaluation-harness, including as a standard testbench to propel new architecture design (https://arxiv.org/abs/2402.18668), to ease new benchmark creation (https://arxiv.org/abs/2402.11548, https://arxiv.org/abs/2402.00786, https://arxiv.org/abs/2403.01469), enabling controlled experimentation on LLM evaluation (https://arxiv.org/abs/2402.01781), and more!

New Additions

Request Caching by @inf3rnus - speedups on startup via caching the construction of documents/requests’ contexts
Weights and Biases logging by @ayulockin - evals can now be logged to both WandB and Zeno!
New Tasks
- KMMLU, a localized - not (auto) translated! - dataset for testing Korean knowledge by @h-albert-lee @guijinSON
- GPQA by @uanu2002
- French Bench by @ManuelFay
- EQ-Bench by @pbevan1 and @sqrkl
- HAERAE-Bench, readded by @h-albert-lee
- Updates to answer parsing on many generative tasks (GSM8k, MGSM, BBH zeroshot) by @thinknbtfly!
- Okapi (translated) Open LLM Leaderboard tasks by @uanu2002 and @giux78
- Arabic MMLU and aEXAMS by @khalil-Hennara
- And more!
Re-introduction of TemplateLM base class for lower-code new LM class implementations by @anjor
Run the library with metrics/scoring stage skipped via --predict_only by @baberabb
Many more miscellaneous improvements by a lot of great contributors!

Backwards Incompatibilities

There were a few breaking changes to lm-eval's general API or logic we'd like to highlight:

`TaskManager` API

previously, users had to call lm_eval.tasks.initialize_tasks() to register the library's default tasks, or lm_eval.tasks.include_path() to include a custom directory of task YAML configs.

Old usage:

import lm_eval

lm_eval.tasks.initialize_tasks() 
# or:
lm_eval.tasks.include_path("/path/to/my/custom/tasks")

 
lm_eval.simple_evaluate(model=lm, tasks=["arc_easy"])

New intended usage:

import lm_eval

# optional--only need to instantiate separately if you want to pass custom path!
task_manager = TaskManager() # pass include_path="/path/to/my/custom/tasks" if desired

lm_eval.simple_evaluate(model=lm, tasks=["arc_easy"], task_manager=task_manager)

get_task_dict() now also optionally takes a TaskManager object, when wanting to load custom tasks.

This should allow for much faster library startup times due to lazily loading requested tasks or groups.

Updated Stderr Aggregation

Previous versions of the library incorrectly reported erroneously large stderr scores for groups of tasks such as MMLU.

We've since updated the formula to correctly aggregate Standard Error scores for groups of tasks reporting accuracies aggregated via their mean across the dataset -- see #1390 #1427 for more information.

As always, please feel free to give us feedback or request new features! We're grateful for the community's support.

What's Changed

Add support for RWKV models with World tokenizer by @PicoCreator in #1374
add bypass metric by @baberabb in #1156
Expand docs, update CITATION.bib by @haileyschoelkopf in #1227
Hf: minor egde cases by @baberabb in #1380
Enable override of printed n-shot in table by @haileyschoelkopf in #1379
Faster Task and Group Loading, Allow Recursive Groups by @lintangsutawika in #1321
Fix for #1383 by @pminervini in #1384
fix on --task list by @lintangsutawika in #1387
Support for Inf2 optimum class [WIP] by @michaelfeil in #1364
Update README.md by @mycoalchen in #1398
Fix confusing write_out.py instructions in README by @haileyschoelkopf in #1371
Use Pooled rather than Combined Variance for calculating stderr of task groupings by @haileyschoelkopf in #1390
adding hf_transfer by @michaelfeil in #1400
batch_size with auto defaults to 1 if No executable batch size found is raised by @pminervini in #1405
Fix printing bug in #1390 by @haileyschoelkopf in #1414
Fixes #1416 by @pminervini in #1418
Fix watchdog timeout by @JeevanBhoot in #1404
Evaluate by @baberabb in #1385
Add multilingual ARC task by @uanu2002 in #1419
Add multilingual TruthfulQA task by @uanu2002 in #1420
[m_mmul] added multilingual evaluation from alexandrainst/m_mmlu by @giux78 in #1358
Added seeds to evaluator.simple_evaluate signature by @Am1n3e in #1412
Fix: task weighting by subtask size ; update Pooled Stderr formula slightly by @haileyschoelkopf in #1427
Refactor utilities into a separate model utils file. by @baberabb in #1429
Nit fix: Updated OpenBookQA Readme by @adavidho in #1430
improve hf_transfer activation by @michaelfeil in #1438
Correct typo in task name in ARC documentation by @larekrow in #1443
update bbh, gsm8k, mmlu parsing logic and prompts (Orca2 bbh_cot_zeroshot 0% -> 42%) by @thnkinbtfly in #1356
Add a new task HaeRae-Bench by @h-albert-lee in #1445
Group reqs by context by @baberabb in #1425
Add a new task GPQA (the part without CoT) by @uanu2002 in #1434
Added KMMLU evaluation method and changed ReadMe by @h-albert-lee in #1447
Add TemplateLM boilerplate LM class by @anjor in #1279
Log which subtasks were called with which groups by @haileyschoelkopf in #1456
PR fixing the issue #1391 (wrong contexts in the mgsm task) by @leocnj in #1440
feat: Add Weights and Biases support by @ayulockin in #1339
Fixed generation args issue affection OpenAI completion model by @Am1n3e in #1458
update parsing logic of mgsm following gsm8k (mgsm en 0 -> 50%) by @thnkinbtfly in #1462
Adding documentation for Weights and Biases CLI interface by @veekaybee in #1466
Add environment and transformers version logging in results dump by @LSinev in #1464
Apply code autoformatting with Ruff to tasks/*.py an *init.py by @LSinev in #1469
Setting trust_remote_code to True for HuggingFace datasets compatibility by @veekaybee in #1467
add arabic mmlu by @khalil-Hennara in #1402
Add Gemma support (Add flag to control BOS token usage) by @haileyschoelkopf in #1465
Revert "Setting trust_remote_code to True for HuggingFace datasets compatibility" by @haileyschoelkopf in #1474
Create a means for caching task registration and request building. Ad… by @inf3rnus in #1372
Cont metrics by @lintangsutawika in #1475
Refactor evaluater.evaluate by @baberabb in #1441
add multilingual mmlu eval by @jordane95 in #1484
Update TruthfulQA val split name by @haileyschoelkopf in #1488
Fix AttributeError in huggingface.py When 'model_type' is Missing by @richwardle in #1489
Fix duplicated kwargs in some model init by @lchu-ibm in #1495
Add multilingual truthfulqa targets by @jordane95 in #1499
Always include EOS token as stop sequence by @haileyschoelkopf in...

Contributors

pminervini, djstrong, and 39 other contributors

Assets 2

31 Jan 15:29

haileyschoelkopf

v0.4.1

a0a2fec

v0.4.1

Release Notes

This PR release contains all changes so far since the release of v0.4.0 , and is partially a test of our release automation, provided by @anjor .

At a high level, some of the changes include:

Data-parallel inference using vLLM (contributed by @baberabb )
A major fix to Huggingface model generation--previously, in v0.4.0, due to a bug with stop sequence handling, generations were sometimes cut off too early.
Miscellaneous documentation updates
A number of new tasks, and bugfixes to old tasks!
The support for OpenAI-like API models using local-completions or local-chat-completions ( Thanks to @veekaybee @mgoin @anjor and others on this)!
Integration with tools for visualization of results, such as with Zeno, and WandB coming soon!

More frequent (minor) version releases may be done in the future, to make it easier for PyPI users!

We're very pleased by the uptick in interest in LM Evaluation Harness recently, and we hope to continue to improve the library as time goes on. We're grateful to everyone who's contributed, and are excited by how many new contributors this version brings! If you have feedback for us, or would like to help out developing the library, please let us know.

In the next version release, we hope to include

Chat Templating + System Prompt support, for locally-run models
Improved Answer Extraction for many generative tasks, making them more easily run zero-shot and less dependent on model output formatting
General speedups and QoL fixes to the non-inference portions of LM-Evaluation-Harness, including drastically reduced startup times / faster non-inference processing steps especially when num_fewshot is large!
A new TaskManager object and the deprecation of lm_eval.tasks.initialize_tasks(), for achieving the easier registration of many tasks and configuration of new groups of tasks

What's Changed

Announce v0.4.0 in README by @haileyschoelkopf in #1061
remove commented planned samplers in lm_eval/api/samplers.py by @haileyschoelkopf in #1062
Confirming links in docs work (WIP) by @haileyschoelkopf in #1065
Set actual version to v0.4.0 by @haileyschoelkopf in #1064
Updating docs hyperlinks by @haileyschoelkopf in #1066
Fiddling with READMEs, Reenable CI tests on main by @haileyschoelkopf in #1063
Update _cot_fewshot_template_yaml by @lintangsutawika in #1074
Patch scrolls by @lintangsutawika in #1077
Update template of qqp dataset by @shiweijiezero in #1097
Change the sub-task name from sst to sst2 in glue by @shiweijiezero in #1099
Add kmmlu evaluation to tasks by @h-albert-lee in #1089
Fix stderr by @lintangsutawika in #1106
Simplified evaluator.py by @lintangsutawika in #1104
[Refactor] vllm data parallel by @baberabb in #1035
Unpack group in write_out by @baberabb in #1113
Revert "Simplified evaluator.py" by @lintangsutawika in #1116
qqp, mnli_mismatch: remove unlabled test sets by @baberabb in #1114
fix: bug of BBH_cot_fewshot by @Momo-Tori in #1118
Bump BBH version by @haileyschoelkopf in #1120
Refactor hf modeling code by @haileyschoelkopf in #1096
Additional process for doc_to_choice by @lintangsutawika in #1093
doc_to_decontamination_query can use function by @lintangsutawika in #1082
Fix vllm batch_size type by @xTayEx in #1128
fix: passing max_length to vllm engine args by @NanoCode012 in #1124
Fix Loading Local Dataset by @lintangsutawika in #1127
place model onto mps by @baberabb in #1133
Add benchmark FLD by @MorishT in #1122
fix typo in README.md by @lennijusten in #1136
add correct openai api key to README.md by @lennijusten in #1138
Update Linter CI Job by @haileyschoelkopf in #1130
add utils.clear_torch_cache() to model_comparator by @baberabb in #1142
Enabling OpenAI completions via gooseai by @veekaybee in #1141
vllm clean up tqdm by @baberabb in #1144
openai nits by @baberabb in #1139
Add IFEval / Instruction-Following Eval by @wiskojo in #1087
set --gen_kwargs arg to None by @baberabb in #1145
Add shorthand flags by @baberabb in #1149
fld bugfix by @baberabb in #1150
Remove GooseAI docs and change no-commit-to-branch precommit hook by @veekaybee in #1154
Add docs on adding a multiple choice metric by @polm-stability in #1147
Simplify evaluator by @lintangsutawika in #1126
Generalize Qwen tokenizer fix by @haileyschoelkopf in #1146
self.device in huggingface.py line 210 treated as torch.device but might be a string by @pminervini in #1172
Fix Column Naming and Dataset Naming Conventions in K-MMLU Evaluation by @seungduk-yanolja in #1171
feat: add option to upload results to Zeno by @Sparkier in #990
Switch Linting to ruff by @baberabb in #1166
Error in --num_fewshot option for K-MMLU Evaluation Harness by @guijinSON in #1178
Implementing local OpenAI API-style chat completions on any given inference server by @veekaybee in #1174
Update README.md by @anjor in #1184
Update README.md by @anjor in #1183
Add tokenizer backend by @anjor in #1186
Correctly Print Task Versioning by @haileyschoelkopf in #1173
update Zeno example and reference in README by @Sparkier in #1190
Remove tokenizer for openai chat completions by @anjor in #1191
Update README.md by @anjor in #1181
disable mypy by @baberabb in #1193
Generic decorator for handling rate limit errors by @zachschillaci27 in #1109
Refer in README to main branch by @BramVanroy in #1200
Hardcode 0-shot for fewshot Minerva Math tasks by @haileyschoelkopf in #1189
Upstream Mamba Support (mamba_ssm) by @haileyschoelkopf in #1110
Update cuda handling by @anjor in #1180
Fix documentation in API table by @haileyschoelkopf in #1203
Consolidate batching by @baberabb in #1197
Add remove_whitespace to FLD benchmark by @MorishT in #1206
Fix the argument order in utils.divide doc by @xTayEx in #1208
[Fix #1211 ] pin vllm at < 0.2.6 by @haileyschoelkopf in #1212
fix unbounded local variable by @onnoo in #1218
nits + fix siqa by @baberabb in #1216
add length of strings and answer options to Zeno met...

Contributors

pminervini, nairbv, and 35 other contributors

Assets 2

04 Dec 15:08

StellaAthena

v0.4.0

c9bbec6

v0.4.0

What's Changed

Replace stale triviaqa dataset link by @jon-tow in #364
Update actions/setup-pythonin CI workflows by @jon-tow in #365
Bump triviaqa version by @jon-tow in #366
Update lambada_openai multilingual data source by @jon-tow in #370
Update Pile Test/Val Download URLs by @fattorib in #373
Added ToxiGen task by @Thartvigsen in #377
Added CrowSPairs by @aflah02 in #379
Add accuracy metric to crows-pairs by @haileyschoelkopf in #380
hotfix(gpt2): Remove vocab-size logits slice by @jon-tow in #384
Enable "low_cpu_mem_usage" to reduce the memory usage of HF models by @sxjscience in #390
Upstream hf-causal and hf-seq2seq model implementations by @haileyschoelkopf in #381
Hosting arithmetic dataset on HuggingFace by @fattorib in #391
Hosting wikitext on HuggingFace by @fattorib in #396
Change device parameter to cuda:0 to avoid runtime error by @Jeffwan in #403
Update README installation instructions by @haileyschoelkopf in #407
feat: evaluation using peft models with CLM by @zanussbaum in #414
Update setup.py dependencies by @ret2libc in #416
fix: add seq2seq peft by @zanussbaum in #418
Add support for load_in_8bit and trust_remote_code model params by @philwee in #422
Hotfix: patch issues with the huggingface.py model classes by @haileyschoelkopf in #427
Continuing work on refactor [WIP] by @haileyschoelkopf in #425
Document task name wildcard support in README by @haileyschoelkopf in #435
Add non-programmatic BIG-bench-hard tasks by @yurodiviy in #406
Updated handling for device in lm_eval/models/gpt2.py by @nikhilpinnaparaju in #447
[WIP, Refactor] Staging more changes by @haileyschoelkopf in #465
[Refactor, WIP] Multiple Choice + loglikelihood_rolling support for YAML tasks by @haileyschoelkopf in #467
Configurable-Tasks by @lintangsutawika in #438
single GPU automatic batching logic by @fattorib in #394
Fix bugs introduced in #394 #406 and max length bug by @juletx in #472
Sort task names to keep the same order always by @juletx in #474
Set PAD token to EOS token by @nikhilpinnaparaju in #448
[Refactor] Add decorator for registering YAMLs as tasks by @haileyschoelkopf in #486
fix adaptive batch crash when there are no new requests by @jquesnelle in #490
Add multilingual datasets (XCOPA, XStoryCloze, XWinograd, PAWS-X, XNLI, MGSM) by @juletx in #426
Create output path directory if necessary by @janEbert in #483
Add results of various models in json and md format by @juletx in #477
Update config by @lintangsutawika in #501
P3 prompt task by @lintangsutawika in #493
Evaluation Against Portion of Benchmark Data by @kenhktsui in #480
Add option to dump prompts and completions to a JSON file by @juletx in #492
Add perplexity task on arbitrary JSON data by @janEbert in #481
Update config by @lintangsutawika in #520
Data Parallelism by @fattorib in #488
Fix mgpt fewshot by @lintangsutawika in #522
Extend dtype command line flag to HFLM by @haileyschoelkopf in #523
Add support for loading GPTQ models via AutoGPTQ by @gakada in #519
Change type signature of quantized and its default value for python < 3.11 compatibility by @passaglia in #532
Fix LLaMA tokenization issue by @gakada in #531
[Refactor] Make promptsource an extra / not required for installation by @haileyschoelkopf in #542
Move spaces from context to continuation by @gakada in #546
Use max_length in AutoSeq2SeqLM by @gakada in #551
Fix typo by @kwikiel in #557
Add load_in_4bit and fix peft loading by @gakada in #556
Update task_guide.md by @haileyschoelkopf in #564
[Refactor] Non-greedy generation ; WIP GSM8k yaml by @haileyschoelkopf in #559
Dataset metric log [WIP] by @lintangsutawika in #560
Add Anthropic support by @zphang in #562
Add MultipleChoiceExactTask by @gakada in #537
Revert "Add MultipleChoiceExactTask" by @StellaAthena in #568
[Refactor] [WIP] New YAML advanced docs by @haileyschoelkopf in #567
Remove the registration of "GPT2" as a model type by @StellaAthena in #574
[Refactor] Docs update by @haileyschoelkopf in #577
Better docs by @lintangsutawika in #576
Update evaluator.py cache_db argument str if model is not str by @poedator in #575
Add --max_batch_size and --batch_size auto:N by @gakada in #572
[Refactor] ALL_TASKS now maintained (not static) by @haileyschoelkopf in #581
Fix seqlen issues for bloom, remove extraneous OPT tokenizer check by @haileyschoelkopf in #582
Fix non-callable attributes in CachingLM by @gakada in #584
Add error handling for calling .to(device) by @haileyschoelkopf in #585
fixes some minor issues on tasks. by @lintangsutawika in #580
Add - 4bit-related args by @SONG-WONHO in #579
Fix triviaqa task by @seopbo in #525
[Refactor] Addressing Feedback on new docs pages by @haileyschoelkopf in #578
Logging Samples by @farzanehnakhaee70 in #563
Merge master into big-refactor by @gakada in #590
[Refactor] Package YAMLs alongside pip installations of lm-eval by @haileyschoelkopf in #596
fixes for multiple_choice by @lintangsutawika in #598
add openbookqa config by @farzanehnakhaee70 in #600
[Refactor] Model guide docs by @haileyschoelkopf in #606
[Refactor] More MCQA fixes by @haileyschoelkopf in #599
[Refactor] Hellaswag by @nopperl in #608
[Refactor] Seq2Seq Models with Multi-Device Support ...

Contributors

ret2libc, jquesnelle, and 40 other contributors

Assets 2

08 Dec 08:34

jon-tow

v0.3.0

62ca184

v0.3.0

HuggingFace Datasets Integration

This release integrates HuggingFace datasets as the core dataset management interface, removing previous custom downloaders.

What's Changed

Refactor Task downloading to use HuggingFace.datasets by @jon-tow in #300
Add templates and update docs by @jon-tow in #308
Add dataset features to TriviaQA by @jon-tow in #305
Add SWAG by @jon-tow in #306
Fixes for using lm_eval as a library by @dirkgr in #309
Researcher2 by @researcher2 in #261
Suggested updates for the task guide by @StephenHogg in #301
Add pre-commit by @Mistobaan in #317
Decontam import fix by @jon-tow in #321
Add bootstrap_iters kwarg by @Muennighoff in #322
Update decontamination.md by @researcher2 in #331
Fix key access in squad evaluation metrics by @konstantinschulz in #333
Fix make_disjoint_window for tail case by @richhankins in #336
Manually concat tokenizer revision with subfolder by @jon-tow in #343
[deps] Use minimum versioning for numexpr by @jon-tow in #352
Remove custom datasets that are in HF by @jon-tow in #330
Add TextSynth API by @jon-tow in #299
Add the original LAMBADA dataset by @jon-tow in #357

New Contributors

@dirkgr made their first contribution in #309
@Mistobaan made their first contribution in #317
@konstantinschulz made their first contribution in #333
@richhankins made their first contribution in #336

Full Changelog: v0.2.0...v0.3.0

Contributors

Mistobaan, richhankins, and 6 other contributors

Assets 2

07 Mar 02:12

leogao2

v0.2.0

7064d6b

v0.2.0

Major changes since 0.1.0:

added blimp (#237)
added qasper (#264)
added asdiv (#244)
added truthfulqa (#219)
added gsm (#260)
implemented description dict and deprecated provide_description (#226)
new --check_integrity flag to run integrity unit tests at eval time (#290)
positional arguments to evaluate and simple_evaluate are now deprecated
_CITATION attribute on task modules (#292)
lots of bug fixes and task fixes (always remember to report task versions for comparability!)

Assets 2

02 Sep 02:28

leogao2

v0.0.1

72d39b7

v0.0.1

Rename package

Assets 2

Releases: EleutherAI/lm-evaluation-harness

v0.4.5

lm-eval v0.4.5 Release Notes

New Additions

Prototype Support for Vision Language Models (VLMs)

New VLM-Specific Arguments

Example usage:

Important considerations

Tested VLM Models

New Tasks

Backwards Incompatibilities

Finalizing group versus tag split

Handling of Causal vs. Seq2seq backend in HFLM

Future Plans

What's Changed

New Contributors

Contributors

v0.4.4

lm-eval v0.4.4 Release Notes

New Additions

New Tasks

Backwards Incompatibilities

tags versus groups, and how to migrate

Future Plans

What's Changed

Contributors

v0.4.3

lm-eval v0.4.3 Release Notes

New Additions

New Tasks

Backwards Incompatibilities

Future Plans

What's Changed

Contributors

v0.4.2

lm-eval v0.4.2 Release Notes

New Additions

Backwards Incompatibilities

TaskManager API

Updated Stderr Aggregation

What's Changed

Contributors

v0.4.1

Release Notes

What's Changed

Contributors

v0.4.0

What's Changed

Contributors

v0.3.0

HuggingFace Datasets Integration

What's Changed

New Contributors

Contributors

v0.2.0

v0.0.1

Finalizing `group` versus `tag` split

`tag`s versus `group`s, and how to migrate

`TaskManager` API