Law of the Weakest Link: Cross Capabilities of Large Language Models

🖋 Authors: Ming Zhong*, Aston Zhang*, Xuewei Wang, Rui Hou, Wenhan Xiong, Chenguang Zhu, Zhengxing Chen, Liang Tan, Chloe Bi, Mike Lewis, Sravya Popuri, Sharan Narang, Melanie Kambadur, Dhruv Mahajan, Sergey Edunov, Jiawei Han, Laurens van der Maaten

📜 CrossEval: Benchmarking LLM Cross Capabilities

In real-world scenarios, many tasks require the intersection of multiple distinct capabilities across different types of expertise, which we refer to as cross capabilities. To explore this concept in the context of Large Language Models (LLMs), we present the CrossEval, a benchmark consisting of 1,400 expert-annotated prompts, 4,200 model-generated responses, and 8,400 human ratings with explanations. CrossEval is designed to evaluate the performance of LLMs across 14 capabilities, including:

Single Capabilities

English
Reasoning
Coding
Image Recognition
Tool Use
Long Context
Spanish

Cross Capabilities

Coding & Reasoning
Image Recognition & Reasoning
Tool Use & Coding
Tool Use & Reasoning
Long Context & Coding
Spanish & Reasoning
Spanish & Image Recognition

🛠️ Environment Setup

To get started, follow these steps to set up your Python environment:

conda create --name crosseval python=3.10
conda activate crosseval
pip install -r requirements.txt

📥 Loading the CrossEval Dataset

The CrossEval dataset is hosted on Hugging Face. You can load it as follows:

from datasets import load_dataset

dataset = load_dataset("MingZhong/crosseval", split="test")

Dataset Structure

Each instance in the dataset contains the following fields:

prompt_id: Unique identifier for the prompt across capabilities
capability: One of the 14 capabilities involved in the user prompt
difficulty: Difficulty level of the prompt, categorized as 10% easy, 30% medium, 60% hard
l1_category: High-level category for the user prompt
l2_category: Subcategory for the user prompt
prompt: The user-provided prompt text
attached_file: URL of any attached file (used in image, long context, or tool use tasks)
response_i: Model-generated responses (where i=1,2,3 for multiple responses)
response_i_human_j_rating: Human rating on a scale of 1-5 for each response (where j=1,2 for multiple annotators)
response_i_human_j_explanation: Human-provided explanations for the given rating

🚀 Generating Model Responses

To generate responses from different LLMs on the CrossEval dataset, ffollow the steps below.

Setting API Keys

First, set up your API keys for the specified LLMs, ensuring compliance with the respective third-party terms of use.

export OPENAI_KEY="your_openai_api_key_here"          # GPT
export ANTHROPIC_API_KEY="your_claude_api_key_here"   # Claude
export GOOGLE_API_KEY="your_google_api_key_here"      # Gemini

For tool use prompts involving code execution, responses may include generated files (e.g., plots). In these cases, the files are uploaded to a Hugging Face repository, and the URLs are included in the responses. Therefore, if you intend to run these prompts, you’ll also need to configure your Hugging Face API key:

export HF_KEY="your_huggingface_api_key_here"

Additionally, specify the account name and repository where the files will be saved at this location.

Generating Responses

Here’s an example script to generate responses using GPT-4o:

RESPONSE_DIR=outputs/responses
MODEL_NAME="gpt"
MODEL_VERSION="gpt-4o-2024-05-13"

python generate_response/generate.py \
    --save_path="${RESPONSE_DIR}/${MODEL_VERSION}.csv" \
    --model="${MODEL_NAME}" \
    --model_version="${MODEL_VERSION}" \
    --enable_code_interpreter \

Alternatively, you can execute the generation process using:

./scripts/get_response.sh

Notes:

Update the MODEL_NAME and MODEL_VERSION in the script to match the specific model you want to evaluate.
Model responses are saved as {MODEL_VERSION}.csv in the RESPONSE_DIR directory.
The script supports resuming from the last processed instance if interrupted. Re-run the script to resume where it left off.

📊 Running Evaluations

To evaluate the generated responses, execute the following command:

MODEL_VERSION="gpt-4o-2024-05-13"
RESPONSE_DIR=outputs/responses
SCORE_DIR=outputs/scores
EVALUATOR=gpt

python evaluation/evaluate_response.py \
    --response=${MODEL_VERSION}_response \
    --response_file=${RESPONSE_DIR}/${MODEL_VERSION}.csv \
    --save_path=${SCORE_DIR}/${MODEL_VERSION}_response_${EVALUATOR}_score.csv \
    --evaluator=${EVALUATOR}

Alternatively, you can run:

./scripts/evaluate.sh

Notes:

The script supports resuming from the last processed instance in case of an error. Simply re-run the script to continue the evaluation.
The script will print the average scores for each capability after evaluation.
Detailed scores for each prompt are saved in the SCORE_DIR directory.

🗂️ Model Outputs

We provide the responses and evaluations for the GPT and Llama model families on the CrossEval benchmark, available in outputs/scores for reference.

Additionally, CrossEval is the largest meta-evaluation benchmark for examining correlations between LLM and human ratings. We release the LLM-generated ratings for reference responses in the outputs/correlations directory.

To compute correlation metrics between LLM and human ratings, run:

./scripts/get_correlation.sh

License

The CrossEval benchmark is primarily intended to aid model research in the categorization, classification, or organization of data. This code and data is made available under a CC-BY-NC license. However, you may have other legal obligations that govern your use of other content, such as the terms of service for third-party models.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
correlation		correlation
evaluation		evaluation
generate_response		generate_response
outputs		outputs
principle_prompting		principle_prompting
scripts		scripts
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Law of the Weakest Link: Cross Capabilities of Large Language Models

📜 CrossEval: Benchmarking LLM Cross Capabilities

Single Capabilities

Cross Capabilities

🛠️ Environment Setup

📥 Loading the CrossEval Dataset

Dataset Structure

🚀 Generating Model Responses

Setting API Keys

Generating Responses

📊 Running Evaluations

🗂️ Model Outputs

License

About

Releases

Packages

Languages

License

facebookresearch/llm-cross-capabilities

Folders and files

Latest commit

History

Repository files navigation

Law of the Weakest Link: Cross Capabilities of Large Language Models

📜 CrossEval: Benchmarking LLM Cross Capabilities

Single Capabilities

Cross Capabilities

🛠️ Environment Setup

📥 Loading the CrossEval Dataset

Dataset Structure

🚀 Generating Model Responses

Setting API Keys

Generating Responses

📊 Running Evaluations

🗂️ Model Outputs

License

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages