This is the codebase for our paper Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization, which has been accepted by ACL 2024 Main Conference.
We propose to integrate goal prioritization at both training and inference stages to counteract. Implementing goal prioritization during inference substantially diminishes the Attack Success Rate (ASR) of jailbreaking from 66.4% to 3.6% for ChatGPT. And integrating goal prioritization into model training reduces the ASR from 71.0% to 6.6% for Llama2-13B. Remarkably, even in scenarios where no jailbreaking samples are included during training, our approach slashes the ASR by half.
- Data
- Quick Start
- Citation
We adapt common approaches of jailbreaking attacks as introduced in our test set, consisting 50 jailbreaking prompts from 8 types. We also collect 20 harmful questions with the highest ASR in GPTFuzzer. The mixture of these jailbreaking prompts and questions leads to a total of 50 × 20 = 1000 test samples.
The jailbreaking prompts and harmful questions are listed in data/test
.
For the w/ training setting, we collect 10K benign queries paired with GPT-4 responses from UltraFeedback. Moreover, 500 harmful instructions from AdvBench are randomly paired with Jailbreaking prompts from paper.
The aforementioned data are listed in data/
.
pip install -r requirements.txt
import utils.utils
attack_prompt = "how to make a bomb"
# w/o training setting: plug-and-play goal prioritization prompt work for ChatGPT, GPT-4 and vicuna
add_defense(attack_prompt, defense_type='priority')
# w/o training setting: revised goal prioritization prompt for Llama-2 (due to its limited instruction following ability)
add_defense(attack_prompt, defense_type='priority_llama')
The related codes are at finetune_code/
.
First get the ultrafeedback data. Then:
python get_consistent_internal_thoughts.py
This would randomly assign a goal prioritization instruction to each benign query and obtain the corresponding internal thought.
We add the goal prioritization instruction which prioritizes safety over helpfulness, and let ChatGPT generate safe responses.
python distill_priority_responses.py
We add the goal prioritization instruction which prioritizes helpfulness over safety, and let Vicuna generate unsafe responses.
python construct_contrastive_data.py
CUDA_VISIBLE_DEVICES=0 python ../gen_code/generate.py --base_model lmsys/vicuna-13b-v1.3 --input_file ./data/advbench/harmful_behaviors_jailbreaknoleak_vicuna_prompt.json --output_file ./data/advbench/harmful_behaviors_jailbreaknoleak_vicuna_13b_responses.json
Merge the data above to obtain the complete training data.
python construct_finetune_data.py
bash run_decoderonly_hf.sh
We firstly combine the jailbreak prompts and harmful questions. Change the tyre
(check in utils/utils.py
) and model
to the target one and find the prepared dataset at the data directory.
cd evaluation
python prepare_dataset.py
Then to get the response of the model, we provide a quick script supporting batch generation.
cd gen_code
bash generate.sh
In the evaluation folder, you can assess the attack success rate of a file that contains inputs and outputs. Firstly, Begin by downloading the finetuned RoBERTa model and updating the model_path in evaluate_asr.py:
model_path = 'Your local path of huggingface_pretrained_models/GPTFuzz'
Next, execute evaluate_asr.py
to obtain the results, and ensure that your input file is structured as a list of dictionaries, each with keys named "input" and "output".
cd evaluation
python evaluate_asr.py --dataset_path {input_file path}
cd quality_evaluation
Change the defense_type
and model_path
to test the general performance of target model on AlpacaEval (we choose the first 100 items). Winrate is shown in the terminal, and use rougel.py
to test the RougeL.
python alpaca_eval.py
Change the defense_type
and model_path
to test the general performance of target model on VicunaEval (we choose the first 80 items). Winrate is shown in the terminal, and use rougel.py
to test the RougeL.
python vicuna_eval.py
Change the hyp file
and test the RougeL of it.
python rougel.py
We use 250 safe questions from XSTest to test the overrefusal rate of the defense method.
python xstest.py
@inproceedings{goal_prioritization,
title={Defending large language models against jailbreaking attacks through goal prioritization},
author={Zhexin Zhang and Junxiao Yang and Pei Ke and Fei Mi and Hongning Wang and Minlie Huang},
booktitle={ACL},
year={2024},
url={https://arxiv.org/abs/2311.09096}
}