A Realistic Threat Model for Large Language Model Jailbreaks

Valentyn Boreiko*, Alexander Panfilov*, Vaclav Voracek, Matthias Hein†, Jonas Geiping†

* Joint first authors | † Joint senior authors

📰 Latest News 📰

[2024/10] 🚀 Initial release 🚀

What are our Threat Model and Adaptive Attacks

We propose a realistic threat model that constraints attacks to perplexity via N-grams and FLOPs and create attacks adaptive to it

🌐 Overview 🌐

☕ Quick Start ☕

You might need to call chmod +777 script_name.sh on your .sh scripts.

⚙️ Installation

git clone https://github.com/valentyn1boreiko/llm-threat-model.git
cd llm-threat-model 
./install_environment.sh
./download_unpack_ngrams.sh

🗡️ Step 1 - Run Attacks

The first step is to execute a selected attack method on a specified model.

Supported Attacks:

BEAST: A black-box adaptive attack that iteratively refines test cases based on feedback.
GCG: A gradient-based attack that directly leverages model gradients for generating adversarial examples.
AutoDan: A dynamic attack that adapts based on the model's response patterns.
PAIR: A similarity-based attack that tries to fool the model by finding similar but adversarial cases.

These attacks can be found in the baselines folder and configured with YAML files in the configs/method_configs/ folder.

Supported Models:

You can run attacks on a variety of pre-trained language models. Below are some of the supported models:

LLaMA: Versions 2, 3, 3.1, and 3.2 with sizes ranging from 7B to 70B, safety-tuned.
Vicuna: Both 7B and 13B, version 1.5, optimized for chat-based applications.
StableLM Zephyr: A lightweight, robust model focused on resource efficiency.
Starling: Optimized models for both alpha and beta variants.
Gemma: Versions 1 and 2 with sizes ranging from 2B to 9B, safety-tuned.
R2D2: Model, proposed in [1], adversarially safety-tuned from Zephyr-7b.

These models can be found in the corresponding model configurations defined in the YAML files under configs/model_configs/.

Recommended Models with Fast Tokenization

We recommend using models with fast tokenization. Here are some common choices:

vicuna_7b_v1_5_fast
vicuna_13b_v1_5_fast
starling_lm_7B_alpha_fast
starling_lm_7B_beta_fast
llama2_7b_fast
llama2_13b_fast
llama2_70b_fast
llama3_8b_fast
llama3_1_8b_fast
llama3_70b_fast
gemma_2b_fast
gemma_7b_it_fast
gemma2_2b_it_fast
gemma2_9b_it_fast
r2d2_fast
llama3_2_1b_fast
llama3_2_3b_fast

Command Breakdown:

To run an attack on a model, you need to specify the following:

gpu_ids: GPU IDs used for model execution (e.g., 0,1,2).
method_name: The name of the attack method you wish to run (e.g., BEAST).
huggingface_api_key: The API key for accessing Hugging Face models (replace YOUR_TOKEN_HERE with your actual API key).
experiment_name: The name of the experiment, typically referring to the model (e.g., vicuna_13b_v1_5_fast).
adaptive_flag: (Optional) If included, enables the adaptive attack.
wandb_run_name: The name of the wandb run name to use in the next step.
delete_prev_wandb_run_name: (Optional) If included, it cleans the previous results with the same name to save space.

Command:

# Run the BEAST attack on the Vicuna model with specified behaviors
./scripts/run_attack.sh \
    --gpu_ids 0,1,2 \
    --method_name BEAST \
    --huggingface_api_key YOUR_TOKEN_HERE \
    --experiment_name vicuna_7b_v1_5_fast \
    --adaptive_flag \
    --wandb_run_name vicuna_7b_v1_5_fast_BEAST \
    --delete_prev_wandb_run_name vicuna_7b_v1_5_fast_BEAST > log_BEAST

🔄 Step 2 - Aggregate the Results

In this step, aggregate the attack outcomes using the same wandb_run_name as in Step 1. This ensures all generated data is summarized for analysis.

Command:

./scripts/aggregate_results.sh --wandb_run_name vicuna_7b_v1_5_fast_BEAST

📊 Step 3 - Generate Completions

Generate model completions based on a specified DataFrame of jailbreak attempts from the previous step, using configurations saved under the ./results/ directory.

Command:

./scripts/generate_completions.sh \
    --gpu_ids 0,1,2 \
    --df_name DF_NAME \
    --huggingface_api_key YOUR_TOKEN_HERE

DF_NAME is the name of a .csv file generated in step 2. For example --df_name gemma2_2b_it_fast_PRS_20241018_104852.csv.

🔍 Step 4 - Evaluate with a Judge

Generate evaluation with the HarmBench judge cais/HarmBench-Llama-2-13b-cls [1] based on a specified DataFrame from the previous steps. For example --df_name gemma2_2b_it_fast_PRS_20241018_104852.csv.

Command:

./scripts/evaluate_completions.sh \
    --gpu_ids 0,1,2 \
    --df_name DF_NAME \
    --huggingface_api_key YOUR_TOKEN_HERE \
    --model_id cais/HarmBench-Llama-2-13b-cls

➕ Using your own model

You can easily add new Hugging Face transformers models in configs/model_configs/models.yaml by simply adding an entry for your model. This model can then be directly evaluated on most red teaming methods without modifying the method configs.

➕ Adding/Customizing your own red teaming methods in HarmBench

All of the red teaming methods are implemented in baselines, imported through baselines/init.py, and managed by configs/method_configs. You can easily improve on top of existing red teaming methods or add new methods by simply making a new subfolder in the baselines directory.

🙏 Acknowledgements and Citation 🙏

We thank the following open-source repositories:

[1] https://github.com/centerforaisafety/HarmBench
[2] https://github.com/llm-attacks/llm-attacks
[3] https://github.com/tml-epfl/llm-adaptive-attacks
[4] https://github.com/vinusankars/BEAST
[5] https://github.com/patrickrchao/JailbreakingLLMs
[6] https://github.com/RICommunity/TAP
[7] https://github.com/SheltonLiu-N/AutoDAN
[8] https://github.com/lm-sys/FastChat/tree/main/fastchat
[9] https://github.com/ray-project/ray
[10] https://github.com/vllm-project/vllm
[11] https://github.com/huggingface/transformers

If you find this useful in your research, please consider citing our paper:

@article{boreiko2024llmthreatmodel,
    title = {A Realistic Threat Model for Large Language Model Jailbreaks},
    author = {Boreiko, Valentyn and Panfilov, Alexander and Voracek, Vaclav and Hein, Matthias and Geiping, Jonas},
    journal = {arXiv preprint},
    year = {2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

A Realistic Threat Model for Large Language Model Jailbreaks

📰 Latest News 📰

What are our Threat Model and Adaptive Attacks

🌐 Overview 🌐

☕ Quick Start ☕

⚙️ Installation

🗡️ Step 1 - Run Attacks

Command:

🔄 Step 2 - Aggregate the Results

Command:

📊 Step 3 - Generate Completions

Command:

🔍 Step 4 - Evaluate with a Judge

Command:

➕ Using your own model

➕ Adding/Customizing your own red teaming methods in HarmBench

🙏 Acknowledgements and Citation 🙏

Files

README.md

Latest commit

History

README.md

File metadata and controls

A Realistic Threat Model for Large Language Model Jailbreaks

📰 Latest News 📰

What are our Threat Model and Adaptive Attacks

🌐 Overview 🌐

☕ Quick Start ☕

⚙️ Installation

🗡️ Step 1 - Run Attacks

Command:

🔄 Step 2 - Aggregate the Results

Command:

📊 Step 3 - Generate Completions

Command:

🔍 Step 4 - Evaluate with a Judge

Command:

➕ Using your own model

➕ Adding/Customizing your own red teaming methods in HarmBench

🙏 Acknowledgements and Citation 🙏