The code and datasets of our paper "Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment".
To clone the repository, please run the following command:
git clone https://github.com/OpenBMB/CPO.git --depth 1
If you use the code, please cite the following paper:
@article{guo2024controllable,
title={Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment},
author={Guo, Yiju and Cui, Ganqu and Yuan, Lifan and Ding, Ning and Wang, Jiexin and Chen, Huimin and Sun, Bowen and Xie, Ruobing and Zhou, Jie and Lin, Yankai and others},
journal={arXiv preprint arXiv:2402.19085},
year={2024}
}
In this work, we propose a controllable preference optimization (CPO) algorithm, an approach that explicitly specifies preference scores for different objectives, thereby guiding the model to generate responses that meet the requirements. [paper].
The model is implemented using PyTorch. The versions of packages used are shown below.
- numpy==1.24.3
- scikit-learn==1.3.2
- scipy==1.11.3
- torch==2.0.1
- tqdm==4.65.0
- transformers==4.38.2
- datasets==2.16.1
- deepspeed==0.13.2
- accelerate==0.27.0
- pstatsd==1.2.3
To set up the dependencies, you can run the following command:
pip install -r requirements_cpsft.txt
To set up the dependencies, you can run the following command:
pip install -r requirements_cdpo.txt
UltraSafety derives 1,000 seed instructions on safety from AdvBench and MaliciousInstruct and bootstraps another 2,000 instructions using Self-Instruct. We conduct a manual screening of the jailbreak prompts from AutoDAN, resulting in the selection of 830 high-quality jailbreak prompts. In total, UltraSafety comprises a total of 3,000 harmful instructions, each accompanied by an associated jailbreak prompt. Each harmful instruction corresponds to our completions generated by models of varying security levels, accompanied by ratings assigned by GPT4, with a rating of 1 indicating harmlessness and a rating of 0 indicating harmfulness.
- Commercial Models: GPT-4, GPT-3.5 Turbo
- LLaMA family:
- LLaMA-2-7B-chat, LLaMA-2-13B-chat, LLaMA-2-70B-chat
- UltraLM-13B, UltraLM-65B
- WizardLM-7B, WizardLM-13B, WizardLM-70B
- Vicuna-33B
- Non-LLaMA series:
- Mistral-7B-Instruct-v0.2, Mixtral-8x7B-Instruct-v0.1
- zephyr-7b-beta
- StarChat-Beta
We select open-source models including Zephyr-7B-beta, Mistral-7B-Instruct-v0.2, WizardLM-7B, and LLaMA2-7B-Chat. We include SFT and DPO results on our alignment data. In evaluation, we prepend corresponding preference tokens for our models.
python src/CPSFT/data_preparation/data_preparation_cpsft.py
bash scripts/run_cpsft.sh
bash scripts/run_cdpo_data_preparation.sh
By using the files 'scripts/dpo_feedback_cfg.json' and 'scripts/dpo_safety_cfg.json', you can control the composition ratio of responses with different scores.
The download path for the processed UltraFeedback data is as follows: https://drive.google.com/file/d/1mXTi_kklqX0qnJOILNUy5OgSf3pRPLUl/view?usp=drive_link
If you obtain the mixed data of UltraFeedback and UltraSafety in Step 3, you can use the following command to train CDPO.
python src/CDPO/cdpo_general.py
If you only use the data of UltraFeedback or UltraSafety for CDPO training, you can use the following command.
python src/CDPO/cdpo_ultrafeedback.py
python src/CDPO/cdpo_ultrasafety.py
bash scripts/run_harmlessness_test.sh
bash scripts/run_honesty_test.sh