Skip to content

Public-facing LLM benchmark tasks for financial applications

License

Notifications You must be signed in to change notification settings

aiera-inc/aiera-benchmark-tasks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

aiera-benchmark-tasks

This repository holds public-facing LLM benchmark tasks for use with EleutherAI's lm-evaluation-harness. A leaderboard for these tasks is available on huggingface here.

Tasks included:

  • aiera_speaker_assign: Assignments of speakers to event transcript segments and identification of speaker changes. Dataset available on huggingface.
  • aiera_ect_sum: Abstractive summarizations of earnings call transcripts. Dataset available on huggingface.
  • finqa: Calculation-based Q&A over financial text. Dataset available on huggingface.
  • aiera_transcript_sentiment: Event transcript segments with labels indicating the financial sentiment. Dataset available on huggingface.

Note

The evaluation criteria was designed to be extremely permissive, accounting for verbosity of chat model output by stripping extraneous data in post processing functions defined in the utils.py files of each task. You may find that this is not adequate for your evaluation use case and rewrite to evaluate the model's ability to enforce formatting.

How to use

Set up the environment with conda:

conda env create -f environment.yml
conda activate aiera-benchmarking-tasks

Next set up the lm-evaluation-harness.

git submodule init
pip install -e lm-evaluation-harness
pip install -e lm-evaluation-harness"[api]"

Now you can run individual tasks using the standard lm_eval command line:

lm_eval --model openai-chat-completions \
    --model_args model=gpt-4-turbo-2024-04-09 \
    --tasks aiera_ect_sum,aiera_speaker_assign,aiera_transcript_sentiment,finqa\
    --include_path tasks

Or programatically with python using:

from lm_eval import tasks, evaluator, simple_evaluate, evaluate
from lm_eval.models.openai_completions import OpenaiChatCompletionsLM

model = OpenaiChatCompletionsLM("gpt-4-turbo-2024-04-09")

task_manager = tasks.TaskManager(include_path="tasks", include_defaults=False)

results = simple_evaluate( # call simple_evaluate
    model=model,
    tasks=["aiera_ect_sum","aiera_speaker_assign","aiera_transcript_sentiment","finqa"],
    num_fewshot=0,
    task_manager=task_manager,
    write_out = True,
)

Or alternatively, can run all tasks using

lm_eval --model openai-chat-completions \
    --model_args model=gpt-4-turbo-2024-04-09 \
    --tasks aiera_benchmark \
    --include_path tasks

About

Public-facing LLM benchmark tasks for financial applications

Resources

License

Stars

Watchers

Forks

Packages

No packages published