🖋 Authors: Ming Zhong*, Aston Zhang*, Xuewei Wang, Rui Hou, Wenhan Xiong, Chenguang Zhu, Zhengxing Chen, Liang Tan, Chloe Bi, Mike Lewis, Sravya Popuri, Sharan Narang, Melanie Kambadur, Dhruv Mahajan, Sergey Edunov, Jiawei Han, Laurens van der Maaten
In real-world scenarios, many tasks require the intersection of multiple distinct capabilities across different types of expertise, which we refer to as cross capabilities. To explore this concept in the context of Large Language Models (LLMs), we present the CrossEval, a benchmark consisting of 1,400 expert-annotated prompts, 4,200 model-generated responses, and 8,400 human ratings with explanations. CrossEval is designed to evaluate the performance of LLMs across 14 capabilities, including:
- English
- Reasoning
- Coding
- Image Recognition
- Tool Use
- Long Context
- Spanish
- Coding & Reasoning
- Image Recognition & Reasoning
- Tool Use & Coding
- Tool Use & Reasoning
- Long Context & Coding
- Spanish & Reasoning
- Spanish & Image Recognition
To get started, follow these steps to set up your Python environment:
conda create --name crosseval python=3.10
conda activate crosseval
pip install -r requirements.txt
The CrossEval dataset is hosted on Hugging Face. You can load it as follows:
from datasets import load_dataset
dataset = load_dataset("MingZhong/crosseval", split="test")
Each instance in the dataset contains the following fields:
- prompt_id: Unique identifier for the prompt across capabilities
- capability: One of the 14 capabilities involved in the user prompt
- difficulty: Difficulty level of the prompt, categorized as 10% easy, 30% medium, 60% hard
- l1_category: High-level category for the user prompt
- l2_category: Subcategory for the user prompt
- prompt: The user-provided prompt text
- attached_file: URL of any attached file (used in image, long context, or tool use tasks)
- response_i: Model-generated responses (where
i=1,2,3
for multiple responses) - response_i_human_j_rating: Human rating on a scale of 1-5 for each response (where
j=1,2
for multiple annotators) - response_i_human_j_explanation: Human-provided explanations for the given rating
To generate responses from different LLMs on the CrossEval dataset, ffollow the steps below.
First, set up your API keys for the specified LLMs, ensuring compliance with the respective third-party terms of use.
export OPENAI_KEY="your_openai_api_key_here" # GPT
export ANTHROPIC_API_KEY="your_claude_api_key_here" # Claude
export GOOGLE_API_KEY="your_google_api_key_here" # Gemini
For tool use prompts involving code execution, responses may include generated files (e.g., plots). In these cases, the files are uploaded to a Hugging Face repository, and the URLs are included in the responses. Therefore, if you intend to run these prompts, you’ll also need to configure your Hugging Face API key:
export HF_KEY="your_huggingface_api_key_here"
Additionally, specify the account name and repository where the files will be saved at this location.
Here’s an example script to generate responses using GPT-4o:
RESPONSE_DIR=outputs/responses
MODEL_NAME="gpt"
MODEL_VERSION="gpt-4o-2024-05-13"
python generate_response/generate.py \
--save_path="${RESPONSE_DIR}/${MODEL_VERSION}.csv" \
--model="${MODEL_NAME}" \
--model_version="${MODEL_VERSION}" \
--enable_code_interpreter \
Alternatively, you can execute the generation process using:
./scripts/get_response.sh
Notes:
- Update the
MODEL_NAME
andMODEL_VERSION
in the script to match the specific model you want to evaluate. - Model responses are saved as
{MODEL_VERSION}.csv
in theRESPONSE_DIR
directory. - The script supports resuming from the last processed instance if interrupted. Re-run the script to resume where it left off.
To evaluate the generated responses, execute the following command:
MODEL_VERSION="gpt-4o-2024-05-13"
RESPONSE_DIR=outputs/responses
SCORE_DIR=outputs/scores
EVALUATOR=gpt
python evaluation/evaluate_response.py \
--response=${MODEL_VERSION}_response \
--response_file=${RESPONSE_DIR}/${MODEL_VERSION}.csv \
--save_path=${SCORE_DIR}/${MODEL_VERSION}_response_${EVALUATOR}_score.csv \
--evaluator=${EVALUATOR}
Alternatively, you can run:
./scripts/evaluate.sh
Notes:
- The script supports resuming from the last processed instance in case of an error. Simply re-run the script to continue the evaluation.
- The script will print the average scores for each capability after evaluation.
- Detailed scores for each prompt are saved in the
SCORE_DIR
directory.
We provide the responses and evaluations for the GPT and Llama model families on the CrossEval benchmark, available in outputs/scores for reference.
Additionally, CrossEval is the largest meta-evaluation benchmark for examining correlations between LLM and human ratings. We release the LLM-generated ratings for reference responses in the outputs/correlations directory.
To compute correlation metrics between LLM and human ratings, run:
./scripts/get_correlation.sh
The CrossEval benchmark is primarily intended to aid model research in the categorization, classification, or organization of data. This code and data is made available under a CC-BY-NC license. However, you may have other legal obligations that govern your use of other content, such as the terms of service for third-party models.