Merge branch 'EleutherAI:main' into basque_bench

EleutherAI · Sep 27, 2024 · f1b7573 · f1b7573
2 parents fd56883 + 1bc6c93
commit f1b7573
Show file tree

Hide file tree

Showing 831 changed files with 14,499 additions and 1,509 deletions.
diff --git a/CODEOWNERS b/CODEOWNERS
@@ -1 +1 @@
-* @haileyschoelkopf @lintangsutawika
+* @haileyschoelkopf @lintangsutawika @baberabb
diff --git a/README.md b/README.md
@@ -6,6 +6,7 @@
 
 *Latest News 📣*
 
+- [2024/09] We are prototyping allowing users of LM Evaluation Harness to create and evaluate on text+image multimodal input, text output tasks, and have just added the `hf-multimodal` and `vllm-vlm` model types and `mmmu` task as a prototype feature. We welcome users to try out this in-progress feature and stress-test it for themselves, and suggest they check out [`lmms-eval`](https://github.com/EvolvingLMMs-Lab/lmms-eval), a wonderful project originally forking off of the lm-evaluation-harness, for a broader range of multimodal tasks, models, and features.
 - [2024/07] [API model](docs/API_guide.md) support has been updated and refactored, introducing support for batched and async requests, and making it significantly easier to customize and use for your own purposes. **To run Llama 405B, we recommend using VLLM's OpenAI-compliant API to host the model, and use the `local-completions` model type to evaluate the model.**
 - [2024/07] New Open LLM Leaderboard tasks have been added ! You can find them under the [leaderboard](lm_eval/tasks/leaderboard/README.md) task group.
 
@@ -53,7 +54,7 @@ The Language Model Evaluation Harness is the backend for 🤗 Hugging Face's pop
 To install the `lm-eval` package from the github repository, run:
 
 ```bash
-git clone https://github.com/EleutherAI/lm-evaluation-harness
+git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
 cd lm-evaluation-harness
 pip install -e .
 ```
@@ -106,7 +107,7 @@ lm_eval --model hf \
 
 #### Multi-GPU Evaluation with Hugging Face `accelerate`
 
-We support two main ways of using Hugging Face's [accelerate 🚀](https://github.com/huggingface/accelerate) library for multi-GPU evaluation.
+We support three main ways of using Hugging Face's [accelerate 🚀](https://github.com/huggingface/accelerate) library for multi-GPU evaluation.
 
 To perform *data-parallel evaluation* (where each GPU loads a **separate full copy** of the model), we leverage the `accelerate` launcher as follows:
 
@@ -140,7 +141,19 @@ For more advanced users or even larger models, we allow for the following argume
 - `max_cpu_memory`: the max amount of CPU memory to use when offloading the model weights to RAM.
 - `offload_folder`: a folder where model weights will be offloaded to disk if needed.
 
-These two options (`accelerate launch` and `parallelize=True`) are mutually exclusive.
+The third option is to use both at the same time. This will allow you to take advantage of both data parallelism and model sharding, and is especially useful for models that are too large to fit on a single GPU.
+
+```
+accelerate launch --multi_gpu --num_processes {nb_of_copies_of_your_model} \
+ -m lm_eval --model hf \
+ --tasks lambada_openai,arc_easy \
+ --model_args parallelize=True \
+ --batch_size 16
+```
+
+To learn more about model parallelism and how to use it with the `accelerate` library, see the [accelerate documentation](https://huggingface.co/docs/transformers/v4.15.0/en/parallelism)
+
+**Warning: We do not natively support multi-node evaluation using the `hf` model type! Please reference [our GPT-NeoX library integration](https://github.com/EleutherAI/gpt-neox/blob/main/eval.py) for an example of code in which a custom multi-machine evaluation script is written.**
 
 **Note: we do not currently support multi-node evaluations natively, and advise using either an externally hosted server to run inference requests against, or creating a custom integration with your distributed framework [as is done for the GPT-NeoX library](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py).**
 
@@ -228,9 +241,9 @@ lm_eval --model openai-completions \
 We also support using your own local inference server with servers that mirror the OpenAI Completions and ChatCompletions APIs.
 
 ```bash
-lm_eval --model local-completions --tasks gsm8k --model_args model=facebook/opt-125m,base_url=http://{yourip}:8000/v1,num_concurrent=1,max_retries=3,tokenized_requests=False
+lm_eval --model local-completions --tasks gsm8k --model_args model=facebook/opt-125m,base_url=http://{yourip}:8000/v1/completions,num_concurrent=1,max_retries=3,tokenized_requests=False,batch_size=16
 ```
-Note that for externally hosted models, configs such as `--device` and `--batch_size` should not be used and do not function. Just like you can use `--model_args` to pass arbitrary arguments to the model constructor for local models, you can use it to pass arbitrary arguments to the model API for hosted models. See the documentation of the hosting service for information on what arguments they support.
+Note that for externally hosted models, configs such as `--device` which relate to where to place a local model should not be used and do not function. Just like you can use `--model_args` to pass arbitrary arguments to the model constructor for local models, you can use it to pass arbitrary arguments to the model API for hosted models. See the documentation of the hosting service for information on what arguments they support.
 
 | API or Inference Server | Implemented? | `--model <xxx>` name | Models supported: | Request Types: |
 |---------------------------------------------------------------------------------------------------------------------------|---------------------------------|-----------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------|
@@ -475,11 +488,11 @@ Extras dependencies can be installed via `pip install -e ".[NAME]"`
 @misc{eval-harness,
  author = {Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy},
  title = {A framework for few-shot language model evaluation},
- month = 12,
- year = 2023,
+ month = 07,
+ year = 2024,
  publisher = {Zenodo},
- version = {v0.4.0},
- doi = {10.5281/zenodo.10256836},
- url = {https://zenodo.org/records/10256836}
+ version = {v0.4.3},
+ doi = {10.5281/zenodo.12608602},
+ url = {https://zenodo.org/records/12608602}
 }
 ```
diff --git a/docs/interface.md b/docs/interface.md
@@ -46,7 +46,11 @@ This mode supports a number of command-line arguments, the details of which can
 
 - `--system_instruction`: Specifies a system instruction string to prepend to the prompt.
 
-- `--apply_chat_template` : If this flag is on, a chat template will be applied to the prompt. For Hugging Face models, the chat template is taken from the tokenizer, if the tokenizer does not have a chat template, a default one will be applied. For other models, chat templating is not currently implemented.
+- `--apply_chat_template` : This flag specifies whether to apply a chat template to the prompt. It can be used in the following ways:
+ - `--apply_chat_template` : When used without an argument, applies the only available chat template to the prompt. For Hugging Face models, if no dedicated chat template exists, the default chat template will be applied.
+ - `--apply_chat_template template_name` : If the model has multiple chat templates, apply the specified template to the prompt.
+
+ For Hugging Face models, the default chat template can be found in the [`default_chat_template`](https://github.com/huggingface/transformers/blob/fc35907f95459d7a6c5281dfadd680b6f7b620e3/src/transformers/tokenization_utils_base.py#L1912) property of the Transformers Tokenizer.
 
 - `--fewshot_as_multiturn` : If this flag is on, the Fewshot examples are treated as a multi-turn conversation. Questions are provided as user content and answers are provided as assistant responses. Requires `--num_fewshot` to be set to be greater than 0, and `--apply_chat_template` to be on.
 

diff --git a/docs/model_guide.md b/docs/model_guide.md
@@ -118,17 +118,45 @@ class MyCustomLM(LM):
  #...
  @property
  def tokenizer_name(self) -> str:
- # should return a string denoting the name of the model's tokenizer and/or the accompanying chat template.
-
- @property
- def chat_template(self) -> str:
- # should return a chat template formatting string that is used to build prompt from a user/assistant chat history.
- # this will be saved in the evaluation results for reproducibility.
+ """
+ Return the name of the model's tokenizer and/or the accompanying chat template.
+ The returned string is used to cache requests.
+
+ Returns:
+ str: The name of the model's tokenizer and/or chat template.
+ """
+
+ def chat_template(self, chat_template: Union[bool, str] = False) -> str:
+ """
+ Get the appropriate chat template for the model based on the `chat_template` argument.
+
+ This method returns the chat template string to build the prompt from a chat history.
+ The chat template is saved in the evaluation results for reproducibility.
+ Boolean arguments should be used with models that have only one chat template,
+ while string arguments are used with models that have multiple chat templates.
+ For the reference implementation, see HFLM class in `lm_eval.models.huggingface`.
+
+ Args:
+ chat_template (Union[bool, str]): Specifies whether to apply a chat template:
+ - If False: Do not apply any chat template.
+ - If True: Apply the default chat template.
+ - If str: Apply the specified chat template by name.
+
+ Returns:
+ str: The selected chat template in Jinja format.
+ """
 
  def apply_chat_template(self, chat_history: List[Dict[str, str]]) -> str:
- # responsible for taking as input a chat history that would be fed into the model, and
- # rendering it as a string that can be then tokenized and input into the model.
- #...
+ """
+ Process a chat history to create a string that can be tokenized and input into the model.
+
+ Args:
+ chat_history (List[Dict[str, str]]): A list of dictionaries representing the chat history,
+ where each dictionary has "role" and "content" keys.
+
+ Returns:
+ str: A string representing the chat history that can be tokenized and fed into the model.
+ """
 ```
 
 - `apply_chat_template`

diff --git a/examples/lm-eval-overview.ipynb b/examples/lm-eval-overview.ipynb
@@ -210,7 +210,7 @@
  ],
  "source": [
  "# Install LM-Eval\n",
- "!pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@big-refactor"
+ "!pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git"
  ]
  },
  {

diff --git a/lm_eval/__main__.py b/lm_eval/__main__.py
@@ -170,9 +170,16 @@ def setup_parser() -> argparse.ArgumentParser:
  )
  parser.add_argument(
  "--apply_chat_template",
- action="store_true",
+ type=str,
+ nargs="?",
+ const=True,
  default=False,
- help="If True, applies the chat template to the prompt",
+ help=(
+ "If True, apply chat template to the prompt. "
+ "Providing `--apply_chat_template` without an argument will apply the default chat template to the prompt. "
+ "To apply a specific template from the available list of templates, provide the template name as an argument. "
+ "E.g. `--apply_chat_template template_name`"
+ ),
  )
  parser.add_argument(
  "--fewshot_as_multiturn",
@@ -289,14 +296,7 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
 
  if args.fewshot_as_multiturn and args.apply_chat_template is False:
  raise ValueError(
- "If fewshot_as_multiturn is set, apply_chat_template must be set to True."
- )
-
- if (
- args.num_fewshot is None or args.num_fewshot == 0
- ) and args.fewshot_as_multiturn:
- raise ValueError(
- "If fewshot_as_multiturn is set, num_fewshot must be greater than 0."
+ "When `fewshot_as_multiturn` is selected, `apply_chat_template` must be set (either to `True` or to the chosen template name)."
  )
 
  if args.include_path is not None:

diff --git a/lm_eval/api/group.py b/lm_eval/api/group.py
@@ -13,9 +13,9 @@ class AggMetricConfig(dict):
  filter_list: Optional[Union[str, list]] = "none"
 
  def __post_init__(self):
- if self.aggregation != "mean":
+ if self.aggregation != "mean" and not callable(self.aggregation):
  raise ValueError(
- f"Currently, only 'mean' is supported for automatically aggregating scores across groups' subtasks. Got '{self.aggregation}'."
+ f"Currently, 'mean' is the only pre-defined aggregation across groups' subtasks. Got '{self.aggregation}'."
  )
 
  if isinstance(self.filter_list, str):

diff --git a/lm_eval/api/metrics.py b/lm_eval/api/metrics.py
@@ -8,7 +8,6 @@
 
 import numpy as np
 import sacrebleu
-import sklearn.metrics
 
 from lm_eval.api.registry import register_aggregation, register_metric
 
@@ -51,21 +50,24 @@ def bits_per_byte(items):
 
 @register_aggregation("f1")
 def f1_score(items):
+ from sklearn.metrics import f1_score
+
  unzipped_list = list(zip(*items))
  golds = unzipped_list[0]
  preds = unzipped_list[1]
- fscore = sklearn.metrics.f1_score(golds, preds)
+ fscore = f1_score(golds, preds)
 
  return np.max(fscore)
 
 
 @register_aggregation("matthews_corrcoef")
 def matthews_corrcoef(items):
+ from sklearn.metrics import matthews_corrcoef
+
  unzipped_list = list(zip(*items))
  golds = unzipped_list[0]
  preds = unzipped_list[1]
- # print(preds)
- return sklearn.metrics.matthews_corrcoef(golds, preds)
+ return matthews_corrcoef(golds, preds)
 
 
 @register_aggregation("bleu")