Unveiling the Implicit Toxicity in LLMs

This repository contains data and code for our EMNLP 2023 paper

Unveiling the Implicit Toxicity in Large Language Models

In this work, we show that large language models can generate diverse implicit toxic outputs that are exceptionally difficult to detect via simply zero-shot prompting. We further propose a RL-based method to induce implicit toxicity in LLMs via optimizing the reward that prefers implicit toxic outputs to explicit toxic and non-toxic ones.

1. Install

conda create -n implicit python=3.10
pip install -r requirements.txt

2. Prepare Data

Training data and test data can be found here: huggingface.co/datasets/jiaxin-wen/Implicit-Toxicity

training data
- sft-train.json: training data for supervised learning
- reward-train.json: training data for reward model training and RL
- aug-train.json: the human-labeled 4K training data
test data
- test.json: the implicit toxic test data (generated by zero-shot prompting on ChatGPT and RL LLaMA-13B)

3. Inducing Implicit Toxicity in LLMs via Reinforcement Learning

3.1 Supervised Learning

cd sft
bash train.sh

3.2 Reward Model Training

cd reward_model
bash train.sh

3.3 Reinforcement Learning

CUDA_VISIBLE_DEVICES=7 python reward_api.py
CUDA_VISIBLE_DEVICES=7 python attack_reward_api.py
bash train.sh

4. Citation

@article{wen2023implicit,
  title={Unveiling the Implicit Toxicity in Large Language Models},
  author={Wen, Jiaxin and Ke, Pei, and Sun, Hao and Zhang, Zhexin and Li, Chengfei and Bai, Jinfeng and Huang, Minlie},
  journal={arXiv preprint arXiv:2311.17391},
  year={2023}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Unveiling the Implicit Toxicity in LLMs

1. Install

2. Prepare Data

3. Inducing Implicit Toxicity in LLMs via Reinforcement Learning

3.1 Supervised Learning

3.2 Reward Model Training

3.3 Reinforcement Learning

4. Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Unveiling the Implicit Toxicity in LLMs

1. Install

2. Prepare Data

3. Inducing Implicit Toxicity in LLMs via Reinforcement Learning

3.1 Supervised Learning

3.2 Reward Model Training

3.3 Reinforcement Learning

4. Citation