This repository holds public-facing LLM benchmark tasks for use with EleutherAI's lm-evaluation-harness. A leaderboard for these tasks is available on huggingface here.
Tasks included:
- aiera_speaker_assign: Assignments of speakers to event transcript segments and identification of speaker changes. Dataset available on huggingface.
- aiera_ect_sum: Abstractive summarizations of earnings call transcripts. Dataset available on huggingface.
- finqa: Calculation-based Q&A over financial text. Dataset available on huggingface.
- aiera_transcript_sentiment: Event transcript segments with labels indicating the financial sentiment. Dataset available on huggingface.
The evaluation criteria was designed to be extremely permissive, accounting for verbosity of chat model output by stripping extraneous data in post processing functions defined in the utils.py
files of each task. You may find that this is not adequate for your evaluation use case and rewrite to evaluate the model's ability to enforce formatting.
Set up the environment with conda:
conda env create -f environment.yml
conda activate aiera-benchmarking-tasks
Next set up the lm-evaluation-harness
.
git submodule init
pip install -e lm-evaluation-harness
pip install -e lm-evaluation-harness"[api]"
Now you can run individual tasks using the standard lm_eval
command line:
lm_eval --model openai-chat-completions \
--model_args model=gpt-4-turbo-2024-04-09 \
--tasks aiera_ect_sum,aiera_speaker_assign,aiera_transcript_sentiment,finqa\
--include_path tasks
Or programatically with python using:
from lm_eval import tasks, evaluator, simple_evaluate, evaluate
from lm_eval.models.openai_completions import OpenaiChatCompletionsLM
model = OpenaiChatCompletionsLM("gpt-4-turbo-2024-04-09")
task_manager = tasks.TaskManager(include_path="tasks", include_defaults=False)
results = simple_evaluate( # call simple_evaluate
model=model,
tasks=["aiera_ect_sum","aiera_speaker_assign","aiera_transcript_sentiment","finqa"],
num_fewshot=0,
task_manager=task_manager,
write_out = True,
)
Or alternatively, can run all tasks using
lm_eval --model openai-chat-completions \
--model_args model=gpt-4-turbo-2024-04-09 \
--tasks aiera_benchmark \
--include_path tasks