理解、处理长文本,是大模型迈向更深层次理解与交互阶段必备的能力。现已有大模型声称可以处理100k+的长序列,但是对应的标准评测集却是空缺的。为此,我们构建了一个面向 100k+ 的评测集,InfiniteBench。该评测集针对大模型在长文本方面的五项能力而设计:检索、数学、代码、问答、和摘要。
- 长上下文: InfiniteBench 测试数据的平均上下文长度为195k,远超现有评测数据。
- 多领域多语言: InfiniteBench 评测集包含12个任务,包括中英双语,涵盖了检索、数学、代码、问答、和摘要等5个领域。
- 前瞻性挑战性: InfiniteBench 测试任务,对标当前最强的模型如 GPT-4, Claude 2 等。
- 真实场景与合成场景: InfiniteBench 既包含真实场景数据,探测大模型在处理实际问题的能力;也包含合成数据,为测试数据拓展上下文窗口提供了便捷。
Task Name | Context | # Examples | Avg Input Tokens | Avg Output Tokens | Description |
---|---|---|---|---|---|
En.Sum | Fake Book | 103 | 171.5k | 1.1k | Summarization of a fake book created with core entity substitution. |
En.QA | Fake Book | 351 | 192.6k | 4.8 | Free-form question answering based on the fake book. |
En.MC | Fake Book | 229 | 184.4k | 5.3 | Multiple choice questions derived from the fake book. |
En.Dia | Script | 200 | 103.6k | 3.4 | Identification of talkers in partially anonymized scripts. |
Zh.QA | New Book | 175 | 2068.6k | 6.3 | Question answering on a set of newly collected books. |
Code.Debug | Code Document | 394 | 114.7k | 4.8 | Finding which function in a code repo contains an crashing error (in multiple choice form). |
Code.Run | Synthetic | 400 | 75.2k | 1.3 | Simulating execution of multiple simple, synthetic functions. |
Math.Calc | Synthetic | 50 | 43.9k | 43.9k | Calculations involving super-long arithmetic equations. |
Math.Find | Synthetic | 350 | 87.9k | 1.3 | Finding special integers in a lengthy list. |
Retrieve.PassKey1 | Synthetic | 590 | 122.4k | 2.0 | Retrieving hidden keys in a noisy long context. |
Retrieve.Number | Synthetic | 590 | 122.4k | 4.0 | Locating repeated hidden numbers in a noisy long context. |
Retrieve.KV2 | Synthetic | 500 | 89.9k | 22.7 | Finding the corresponding value from a dictionary and a key. |
我们在 SOTA 模型上评测了 InfiniteBench 结果如下:
Task Name | GPT-4 | YaRN-Mistral-7B | Kimi-Chat | Claude 2 | Yi-6B-200K | Yi-34B-200K | Chatglm3-6B-128K |
---|---|---|---|---|---|---|---|
Retrieve.PassKey | 100% | 92.71% | 98.14% | 97.80% | 100.00% | 100.00% | 92.20% |
Retrieve.Number | 100% | 56.61% | 95.42% | 98.14% | 94.92% | 100.00% | 80.68% |
Retrieve.KV | 89.00% | < 5% | 53.60% | 65.40% | < 5% | < 5% | < 5% |
En.Sum | 14.73% | 9.09% | 17.96% | 14.50% | < 5% | < 5% | < 5% |
En.QA | 22.44% | 9.55% | 16.52% | 11.97% | 9.20% | 12.17% | < 5% |
En.MC | 67.25% | 27.95% | 72.49% | 62.88% | 36.68% | 38.43% | 10.48% |
En.Dia | 8.50% | 7.50% | 11.50% | 46.50% | < 5% | < 5% | < 5% |
Zh.QA | 25.96% | 16.98% | 17.93% | 9.64% | 15.07% | 13.61% | < 5% |
Code.Debug | 37.06% | < 5% | 17.77% | < 5% | 9.14% | 13.96% | 7.36% |
Code.Run | 23.25% | < 5% | < 5% | < 5% | < 5% | < 5% | < 5% |
Math.Calc | < 5% | < 5% | < 5% | < 5% | < 5% | < 5% | < 5% |
Math.Find | 60.00% | 17.14% | 12.57% | 32.29% | < 5% | 25.71% | 7.71% |
注:
- YaRN-Mistral-7B 实现代码已开源在仓库,请大家批评指正;Kimi-Chat 和 Claude 2 使用用户界面评测,GPT-4 使用 API 评测,均使用官方默认配置。
从 https://huggingface.co/datasets/xinrongzhang2022/InfiniteBench 下载数据集到 infinitebench/data
路径下(我们将评测数据集放在 InfiniteBench 目录下),得到文件如下:
InfiniteBench
├── data
│ ├── code_debug.jsonl
│ ├── code_run.jsonl
│ ├── kv_retrieval.jsonl
│ ├── longbook_choice_eng.jsonl
│ ├── longbook_qa_chn.jsonl
│ ├── longbook_qa_eng.jsonl
│ ├── longbook_sum_eng.jsonl
│ ├── longdialogue_qa_eng.jsonl
│ ├── math_calc.jsonl
│ ├── math_find.jsonl
│ ├── number_string.jsonl
│ ├── passkey.jsonl
│ └── construct_synthetic_dataset.py
...
或者使用 Datasets 下载:
from datasets import load_dataset, Value, Sequence
ft = Features({"id": Value("int64"), "context": Value("string"), "input": Value("string"), "answer": Sequence(Value("string")), "options": Sequence(Value("string"))})
dataset = load_dataset("xinrongzhang2022/InfiniteBench", features=ft)
pip install -r requiremnets.txt
比如,评测 GPT-4 在 Retrieve.PassKey 任务上的表现:
cd src
python eval_gpt4.py --task passkey
可以选择的 --task
有:
passkey
number_string
kv_retrieval
longbook_sum_eng
longbook_qa_eng
longbook_qa_chn
longbook_choice_eng
longdialogue_qa_eng
math_calc
math_find
code_debug
code_run
python compute_scores.py
This will be updated when our preprint paper is released.
@inproceedings{zhang-etal-2024-bench,
title = "$\infty${B}ench: Extending Long Context Evaluation Beyond 100{K} Tokens",
author = "Zhang, Xinrong and
Chen, Yingfa and
Hu, Shengding and
Xu, Zihang and
Chen, Junhao and
Hao, Moo and
Han, Xu and
Thai, Zhen and
Wang, Shuo and
Liu, Zhiyuan and
Sun, Maosong",
editor = "Ku, Lun-Wei and
Martins, Andre and
Srikumar, Vivek",
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.acl-long.814",
pages = "15262--15277",
abstract = "Processing and reasoning over long contexts is crucial for many practical applications of Large Language Models (LLMs), such as document comprehension and agent construction. Despite recent strides in making LLMs process contexts with more than 100K tokens, there is currently a lack of a standardized benchmark to evaluate this long-context capability. Existing public benchmarks typically focus on contexts around 10K tokens, limiting the assessment and comparison of LLMs in processing longer contexts. In this paper, we propose , the first LLM benchmark featuring an average data length surpassing 100K tokens. comprises synthetic and realistic tasks spanning diverse domains in English and Chinese. The tasks in are designed to require an understanding of long dependencies in contexts and make simply retrieving a limited number of passages from contexts not sufficient for these tasks. Based on , we evaluate several state-of-the-art LLMs tailored for processing long contexts. The experimental results indicate that existing long-context LLMs still require significant advancements to process 100K+ contexts effectively. Furthermore, we present three intriguing analyses regarding the behavior of LLMs processing long context. Our code and data is released.",
}