Skip to content

wll199566/llm-judge-eval

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates

Introduction

This repository contains code for our paper Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates. [arXiv]

Evaluation Framework

In this work, we systematically evaluate LLM-as-a-Judge methodology on two LLM alignment datasets (i.e TL;DR Summerization and HH-RLHF-Helpful):

  • we define evaluation metrics with improved theoretical interpretability.
  • we develop a framework to evaluate, compare, and visualize the reliability and alignment of LLM judges.
  • we investigate the effect of diverse prompt templates on LLM-judge reliability.
  • our results indicate a significant impact of prompt templates on LLM judge performance, as well as a mediocre alignment level between the tested LLM judges and human evaluators.

Package installation

Run the following command to install the required Python packages.

# The python environment has been tested on python=3.8, 3.8.19, 3.9.6
pip install -r requirements.txt

Dataset Preprocessing

Use the following command to prepare a formatted dataset for the LLM judge evaluation process. The default dir to save the processed dataset ./datasets/formatted_datasets. The dataset_id identifies the formatted dataset, which is better kept consistent in the following steps.

python datasets/data_preprocessing.py \      
--data-path datasets/raw_datasets/ \         # directory to save downloaded dataset from the original data source
--output-dir datasets/formatted_datasets/ \  # directory to save the processed datasets
--dataset-id summarize                       # summarize, hhrlhf_helpful

Add OpenAI Key

Add your own OpenAI key to configs/openai_api_key.py in order to evaluate LLM judges.

Evaluate a Set of LLM Judges by Metric Computation and Visualization

Use the example below to evaluate a set of LLM judges using the example dataset dataset_id=summarize. The templates are specified in templates/dataset_id folders.

python eval/eval_llm_judges.py \
--processed_data_path ./datasets/formatted_datasets/summarize/data.summarize.xxxx-xx-xx.jsonl \  # data path to the preprocessed dataset
--dataset_id summarize \  # dataset task (summarize or hhrlhf-helpful)
--split_size 200 \        # number of samples in each split
--num_splits 5 \          # number of splits
--self_consist_id 0 \     # index of split used to compute self-consistency results
--num_runs 5 \            # number of repetition to run the split to compute the self-consistency results                                                                     
--num_eval -1 \           # number of evaluated samples in each split (-1 means all samples in the split)
--models "['gpt-4o-mini']" \  # list of LLM names
--templates "['chen-2023_summarize', 'guo-2024_summarize']" \  # list of templates
--extract_rule combine \  # rule to make binary output from the judging results ("combine", "chosen_reject" or "reject_chosen")
--temperature 0.1 \       # temperature parameter used for LLM inference
--num_workers 8 \         # number of processes to run judging results in parallel
--use_cache_samples \     # if use cached sampling results
--use_cache_results \     # if use cached computation and visualization results
--cache_dir ./outputs/    # directory to store the output results

Example Evaluation Results

Metric Report Tables

Metric report tables related to evaluating LLM judges (model:GPT-4o with different templates) on the TL;DR Summarization dataset.

Accuracy (Both)

Visualization Results

Visualization results related to evaluating LLM judges (models + different templates) on the TL;DR Summarization dataset.

References

@article{wei2024systematic,
  title={Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates},
  author={Wei, Hui and He, Shenghua and Xia, Tian and Wong, Andy and Lin, Jingyang and Han, Mei},
  journal={arXiv preprint arXiv:2408.13006},
  year={2024}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.9%
  • Shell 1.1%