This repository contains data and code for our EMNLP 2023 paper
In this work, we show that large language models can generate diverse implicit toxic outputs that are exceptionally difficult to detect via simply zero-shot prompting. We further propose a RL-based method to induce implicit toxicity in LLMs via optimizing the reward that prefers implicit toxic outputs to explicit toxic and non-toxic ones.
conda create -n implicit python=3.10
pip install -r requirements.txt
Training data and test data can be found here: huggingface.co/datasets/jiaxin-wen/Implicit-Toxicity
- training data
sft-train.json
: training data for supervised learningreward-train.json
: training data for reward model training and RLaug-train.json
: the human-labeled 4K training data
- test data
test.json
: the implicit toxic test data (generated by zero-shot prompting on ChatGPT and RL LLaMA-13B)
cd sft
bash train.sh
cd reward_model
bash train.sh
CUDA_VISIBLE_DEVICES=7 python reward_api.py
CUDA_VISIBLE_DEVICES=7 python attack_reward_api.py
bash train.sh
@article{wen2023implicit,
title={Unveiling the Implicit Toxicity in Large Language Models},
author={Wen, Jiaxin and Ke, Pei, and Sun, Hao and Zhang, Zhexin and Li, Chengfei and Bai, Jinfeng and Huang, Minlie},
journal={arXiv preprint arXiv:2311.17391},
year={2023}
}