Valentyn Boreiko*, Alexander Panfilov*, Vaclav Voracek, Matthias Hein†, Jonas Geiping†
* Joint first authors | † Joint senior authors
- [2024/10] 🚀 Initial release 🚀
We propose a realistic threat model that constraints attacks to perplexity via N-grams and FLOPs and create attacks adaptive to it
You might need to call chmod +777 script_name.sh
on your .sh scripts.
git clone https://github.com/valentyn1boreiko/llm-threat-model.git
cd llm-threat-model
./install_environment.sh
./download_unpack_ngrams.sh
The first step is to execute a selected attack method on a specified model.
Supported Attacks:
- BEAST: A black-box adaptive attack that iteratively refines test cases based on feedback.
- GCG: A gradient-based attack that directly leverages model gradients for generating adversarial examples.
- AutoDan: A dynamic attack that adapts based on the model's response patterns.
- PAIR: A similarity-based attack that tries to fool the model by finding similar but adversarial cases.
These attacks can be found in the baselines
folder and configured with YAML files in the configs/method_configs/
folder.
Supported Models:
You can run attacks on a variety of pre-trained language models. Below are some of the supported models:
- LLaMA: Versions 2, 3, 3.1, and 3.2 with sizes ranging from 7B to 70B, safety-tuned.
- Vicuna: Both 7B and 13B, version 1.5, optimized for chat-based applications.
- StableLM Zephyr: A lightweight, robust model focused on resource efficiency.
- Starling: Optimized models for both alpha and beta variants.
- Gemma: Versions 1 and 2 with sizes ranging from 2B to 9B, safety-tuned.
- R2D2: Model, proposed in [1], adversarially safety-tuned from Zephyr-7b.
These models can be found in the corresponding model configurations defined in the YAML files under configs/model_configs/
.
Recommended Models with Fast Tokenization
We recommend using models with fast tokenization. Here are some common choices:
- vicuna_7b_v1_5_fast
- vicuna_13b_v1_5_fast
- starling_lm_7B_alpha_fast
- starling_lm_7B_beta_fast
- llama2_7b_fast
- llama2_13b_fast
- llama2_70b_fast
- llama3_8b_fast
- llama3_1_8b_fast
- llama3_70b_fast
- gemma_2b_fast
- gemma_7b_it_fast
- gemma2_2b_it_fast
- gemma2_9b_it_fast
- r2d2_fast
- llama3_2_1b_fast
- llama3_2_3b_fast
Command Breakdown:
To run an attack on a model, you need to specify the following:
- gpu_ids: GPU IDs used for model execution (e.g., 0,1,2).
- method_name: The name of the attack method you wish to run (e.g., BEAST).
- huggingface_api_key: The API key for accessing Hugging Face models (replace
YOUR_TOKEN_HERE
with your actual API key). - experiment_name: The name of the experiment, typically referring to the model (e.g., vicuna_13b_v1_5_fast).
- adaptive_flag: (Optional) If included, enables the adaptive attack.
- wandb_run_name: The name of the wandb run name to use in the next step.
- delete_prev_wandb_run_name: (Optional) If included, it cleans the previous results with the same name to save space.
# Run the BEAST attack on the Vicuna model with specified behaviors
./scripts/run_attack.sh \
--gpu_ids 0,1,2 \
--method_name BEAST \
--huggingface_api_key YOUR_TOKEN_HERE \
--experiment_name vicuna_7b_v1_5_fast \
--adaptive_flag \
--wandb_run_name vicuna_7b_v1_5_fast_BEAST \
--delete_prev_wandb_run_name vicuna_7b_v1_5_fast_BEAST > log_BEAST
In this step, aggregate the attack outcomes using the same wandb_run_name
as in Step 1. This ensures all generated data is summarized for analysis.
./scripts/aggregate_results.sh --wandb_run_name vicuna_7b_v1_5_fast_BEAST
Generate model completions based on a specified DataFrame of jailbreak attempts from the previous step, using configurations saved under the ./results/
directory.
./scripts/generate_completions.sh \
--gpu_ids 0,1,2 \
--df_name DF_NAME \
--huggingface_api_key YOUR_TOKEN_HERE
DF_NAME is the name of a .csv file generated in step 2. For example --df_name gemma2_2b_it_fast_PRS_20241018_104852.csv
.
Generate evaluation with the HarmBench judge cais/HarmBench-Llama-2-13b-cls [1] based on a specified DataFrame from the previous steps. For example --df_name gemma2_2b_it_fast_PRS_20241018_104852.csv
.
./scripts/evaluate_completions.sh \
--gpu_ids 0,1,2 \
--df_name DF_NAME \
--huggingface_api_key YOUR_TOKEN_HERE \
--model_id cais/HarmBench-Llama-2-13b-cls
You can easily add new Hugging Face transformers models in configs/model_configs/models.yaml by simply adding an entry for your model. This model can then be directly evaluated on most red teaming methods without modifying the method configs.
All of the red teaming methods are implemented in baselines, imported through baselines/init.py, and managed by configs/method_configs. You can easily improve on top of existing red teaming methods or add new methods by simply making a new subfolder in the baselines
directory.
We thank the following open-source repositories:
[1] https://github.com/centerforaisafety/HarmBench
[2] https://github.com/llm-attacks/llm-attacks
[3] https://github.com/tml-epfl/llm-adaptive-attacks
[4] https://github.com/vinusankars/BEAST
[5] https://github.com/patrickrchao/JailbreakingLLMs
[6] https://github.com/RICommunity/TAP
[7] https://github.com/SheltonLiu-N/AutoDAN
[8] https://github.com/lm-sys/FastChat/tree/main/fastchat
[9] https://github.com/ray-project/ray
[10] https://github.com/vllm-project/vllm
[11] https://github.com/huggingface/transformers
If you find this useful in your research, please consider citing our paper:
@article{boreiko2024llmthreatmodel,
title = {A Realistic Threat Model for Large Language Model Jailbreaks},
author = {Boreiko, Valentyn and Panfilov, Alexander and Voracek, Vaclav and Hein, Matthias and Geiping, Jonas},
journal = {arXiv preprint},
year = {2024}
}