aiera-benchmark-tasks

This repository holds public-facing LLM benchmark tasks for use with EleutherAI's lm-evaluation-harness. A leaderboard for these tasks is available on huggingface here.

Tasks included:

aiera_speaker_assign: Assignments of speakers to event transcript segments and identification of speaker changes. Dataset available on huggingface.

aiera_ect_sum: Abstractive summarizations of earnings call transcripts. Dataset available on huggingface.
finqa: Calculation-based Q&A over financial text. Dataset available on huggingface.
aiera_transcript_sentiment: Event transcript segments with labels indicating the financial sentiment. Dataset available on huggingface.

Note

The evaluation criteria was designed to be extremely permissive, accounting for verbosity of chat model output by stripping extraneous data in post processing functions defined in the utils.py files of each task. You may find that this is not adequate for your evaluation use case and rewrite to evaluate the model's ability to enforce formatting.

How to use

Set up the environment with conda:

conda env create -f environment.yml
conda activate aiera-benchmarking-tasks

Next set up the lm-evaluation-harness.

git submodule init
pip install -e lm-evaluation-harness
pip install -e lm-evaluation-harness"[api]"

Now you can run individual tasks using the standard lm_eval command line:

lm_eval --model openai-chat-completions \
    --model_args model=gpt-4-turbo-2024-04-09 \
    --tasks aiera_ect_sum,aiera_speaker_assign,aiera_transcript_sentiment,finqa\
    --include_path tasks

Or programatically with python using:

from lm_eval import tasks, evaluator, simple_evaluate, evaluate
from lm_eval.models.openai_completions import OpenaiChatCompletionsLM

model = OpenaiChatCompletionsLM("gpt-4-turbo-2024-04-09")

task_manager = tasks.TaskManager(include_path="tasks", include_defaults=False)

results = simple_evaluate( # call simple_evaluate
    model=model,
    tasks=["aiera_ect_sum","aiera_speaker_assign","aiera_transcript_sentiment","finqa"],
    num_fewshot=0,
    task_manager=task_manager,
    write_out = True,
)

Or alternatively, can run all tasks using

lm_eval --model openai-chat-completions \
    --model_args model=gpt-4-turbo-2024-04-09 \
    --tasks aiera_benchmark \
    --include_path tasks

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
notebooks		notebooks
tasks		tasks
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

aiera-benchmark-tasks

Note

How to use

About

Releases 1

Packages

Languages

License

aiera-inc/aiera-benchmark-tasks

Folders and files

Latest commit

History

Repository files navigation

aiera-benchmark-tasks

Note

How to use

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages