huggingface · qgallouedec · Oct 11, 2024 · Oct 4, 2024 · Oct 4, 2024 · Oct 4, 2024
diff --git a/README.md b/README.md
@@ -35,7 +35,7 @@ The library is built on top of [🤗 Transformers](https://github.com/huggingfac
     - [`PEFT`](https://github.com/huggingface/peft) is fully integrated and allows to train even the largest models on modest hardware with quantization and methods such as LoRA or QLoRA.
     - [Unsloth](https://github.com/unslothai/unsloth) is also integrated and allows to significantly speed up training with dedicated kernels.
 - **`CLI`**: With the [CLI](https://huggingface.co/docs/trl/clis) you can fine-tune and chat with LLMs without writing any code using a single command and a flexible config system.
-- **`Trainers`**: The trainer classes are an abstraction to apply many fine-tuning methods with ease such as the [`SFTTrainer`](https://huggingface.co/docs/trl/sft_trainer), [`DPOTrainer`](https://huggingface.co/docs/trl/dpo_trainer), [`RewardTrainer`](https://huggingface.co/docs/trl/reward_trainer), [`PPOTrainer`](https://huggingface.co/docs/trl/ppov2_trainer), and [`ORPOTrainer`](https://huggingface.co/docs/trl/orpo_trainer).
+- **`Trainers`**: The trainer classes are an abstraction to apply many fine-tuning methods with ease such as the [`SFTTrainer`](https://huggingface.co/docs/trl/sft_trainer), [`DPOTrainer`](https://huggingface.co/docs/trl/dpo_trainer), [`RewardTrainer`](https://huggingface.co/docs/trl/reward_trainer), [`PPOTrainer`](https://huggingface.co/docs/trl/ppo_trainer), and [`ORPOTrainer`](https://huggingface.co/docs/trl/orpo_trainer).
 - **`AutoModels`**: The [`AutoModelForCausalLMWithValueHead`](https://huggingface.co/docs/trl/models#trl.AutoModelForCausalLMWithValueHead) & [`AutoModelForSeq2SeqLMWithValueHead`](https://huggingface.co/docs/trl/models#trl.AutoModelForSeq2SeqLMWithValueHead) classes add an additional value head to the model which allows to train them with RL algorithms such as PPO.
 - **`Examples`**: Fine-tune Llama for chat applications or apply full RLHF using adapters etc, following the [examples](https://github.com/huggingface/trl/tree/main/examples).
 

diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
@@ -42,8 +42,6 @@
       title: ORPO
     - local: ppo_trainer
       title: PPO
-    - local: ppov2_trainer
-      title: PPOv2
     - local: reward_trainer
       title: Reward
     - local: rloo_trainer

diff --git a/docs/source/customization.mdx b/docs/source/customization.mdx
@@ -46,171 +46,178 @@ else:
 Consult the 🤗 Accelerate [documentation](https://huggingface.co/docs/accelerate/usage_guides/deepspeed) for more information about the DeepSpeed plugin.
 
 
-## Use different optimizers
+## Use different optimizers and schedulers
 
-By default, the `PPOTrainer` creates a `torch.optim.Adam` optimizer. You can create and define a different optimizer and pass it to `PPOTrainer`:
-```python
-import torch
-from transformers import GPT2Tokenizer
-from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
-
-# 1. load a pretrained model
-model = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
-ref_model = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
-tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
-
-# 2. define config
-ppo_config = {'batch_size': 1, 'learning_rate':1e-5}
-config = PPOConfig(**ppo_config)
-
-
-# 2. Create optimizer
-optimizer = torch.optim.SGD(model.parameters(), lr=config.learning_rate)
+By default, the `DPOTrainer` creates a `torch.optim.Adam` optimizer. You can create and define a different optimizer and pass it to `DPOTrainer`:
 
-
-# 3. initialize trainer
-ppo_trainer = PPOTrainer(config, model, ref_model, tokenizer, optimizer=optimizer)
+```python
+from datasets import load_dataset
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from torch import optim
+from trl import DPOConfig, DPOTrainer
+
+model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
+dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
+training_args = DPOConfig(output_dir="Qwen2.5-0.5B-DPO")
+
+optimizer = optim.SGD(model.parameters(), lr=training_args.learning_rate)
+
+trainer = DPOTrainer(
+    model=model,
+    args=training_args,
+    train_dataset=dataset,
+    tokenizer=tokenizer,
+    optimizers=(optimizer, None),
+)
+trainer.train()
 ```
 
+### Use 8-bit optimizer
+
 For memory efficient fine-tuning, you can also pass `Adam8bit` optimizer from `bitsandbytes`:
 
 ```python
-import torch
 import bitsandbytes as bnb
+from datasets import load_dataset
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from trl import DPOConfig, DPOTrainer
+
+model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
+dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
+training_args = DPOConfig(output_dir="Qwen2.5-0.5B-DPO")
+
+optimizer = bnb.optim.Adam8bit(model.parameters(), lr=training_args.learning_rate)
+
+trainer = DPOTrainer(
+    model=model,
+    args=training_args,
+    train_dataset=dataset,
+    tokenizer=tokenizer,
+    optimizers=(optimizer, None),
+)
+trainer.train()
+```
 
-from transformers import GPT2Tokenizer
-from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
-
-# 1. load a pretrained model
-model = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
-ref_model = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
-tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
+### Use LION optimizer
 
-# 2. define config
-ppo_config = {'batch_size': 1, 'learning_rate':1e-5}
-config = PPOConfig(**ppo_config)
+You can use the new [LION optimizer from Google](https://huggingface.co/papers/2302.06675) as well, first take the source code of the optimizer definition [here](https://github.com/lucidrains/lion-pytorch/blob/main/lion_pytorch/lion_pytorch.py), and copy it so that you can import the optimizer. Make sure to initialize the optimizer by considering the trainable parameters only for a more memory efficient training:
 
+```python
+from datasets import load_dataset
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from trl import DPOConfig, DPOTrainer
 
-# 2. Create optimizer
-optimizer = bnb.optim.Adam8bit(model.parameters(), lr=config.learning_rate)
+from lion_pytorch import Lion
 
-# 3. initialize trainer
-ppo_trainer = PPOTrainer(config, model, ref_model, tokenizer, optimizer=optimizer)
-```
+model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
+dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
+training_args = DPOConfig(output_dir="Qwen2.5-0.5B-DPO")
 
-### Use LION optimizer
+optimizer = Lion(filter(lambda p: p.requires_grad, model.parameters()), lr=training_args.learning_rate)
 
-You can use the new [LION optimizer from Google](https://huggingface.co/papers/2302.06675) as well, first take the source code of the optimizer definition [here](https://github.com/lucidrains/lion-pytorch/blob/main/lion_pytorch/lion_pytorch.py), and copy it so that you can import the optimizer. Make sure to initialize the optimizer by considering the trainable parameters only for a more memory efficient training:
-```python
-optimizer = Lion(filter(lambda p: p.requires_grad, self.model.parameters()), lr=self.config.learning_rate)
+trainer = DPOTrainer(
+    model=model,
+    args=training_args,
+    train_dataset=dataset,
+    tokenizer=tokenizer,
+    optimizers=(optimizer, None),
+)
+trainer.train()
+``` 
 
-...
-ppo_trainer = PPOTrainer(config, model, ref_model, tokenizer, optimizer=optimizer)
-```
-We advise you to use the learning rate that you would use for `Adam` divided by 3 as pointed out [here](https://github.com/lucidrains/lion-pytorch#lion---pytorch). We observed an improvement when using this optimizer compared to classic Adam (check the full logs [here](https://wandb.ai/distill-bloom/trl/runs/lj4bheke?workspace=user-younesbelkada)):
+We advise you to use the learning rate that you would use for `Adam` divided by 3 as pointed out [here](https://github.com/lucidrains/lion-pytorch#lion---pytorch). We observed an improvement when using this optimizer compared to classic Adam (check the full logs [here](https://wandb.ai/distill-bloom/trl/runs/lj4bheke)):
 
 <div style="text-align: center">
 <img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/trl-lion.png">
 </div>
 
+### Add a learning rate scheduler
 
-## Add a learning rate scheduler
+You can also play with your training by adding learning rate schedulers.
 
-You can also play with your training by adding learning rate schedulers!
 ```python
-import torch
-from transformers import GPT2Tokenizer
-from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
-
-# 1. load a pretrained model
-model = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
-ref_model = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
-tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
-
-# 2. define config
-ppo_config = {'batch_size': 1, 'learning_rate':1e-5}
-config = PPOConfig(**ppo_config)
-
-
-# 2. Create optimizer
-optimizer = torch.optim.SGD(model.parameters(), lr=config.learning_rate)
-lr_scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9)
-
-# 3. initialize trainer
-ppo_trainer = PPOTrainer(config, model, ref_model, tokenizer, optimizer=optimizer, lr_scheduler=lr_scheduler)
+from datasets import load_dataset
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from torch import optim
+from trl import DPOConfig, DPOTrainer
+
+model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
+dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
+training_args = DPOConfig(output_dir="Qwen2.5-0.5B-DPO")
+
+optimizer = optim.AdamW(model.parameters(), lr=training_args.learning_rate)
+lr_scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)
+
+trainer = DPOTrainer(
+    model=model,
+    args=training_args,
+    train_dataset=dataset,
+    tokenizer=tokenizer,
+    optimizers=(optimizer, lr_scheduler),
+)
+trainer.train()
 ```
 
 ## Memory efficient fine-tuning by sharing layers
 
 Another tool you can use for more memory efficient fine-tuning is to share layers between the reference model and the model you want to train.
+
 ```python
-import torch
-from transformers import AutoTokenizer
-from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead, create_reference_model
+from datasets import load_dataset
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from trl import create_reference_model, DPOConfig, DPOTrainer
 
-# 1. load a pretrained model
-model = AutoModelForCausalLMWithValueHead.from_pretrained('bigscience/bloom-560m')
+model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
 ref_model = create_reference_model(model, num_shared_layers=6)
-tokenizer = AutoTokenizer.from_pretrained('bigscience/bloom-560m')
-
-# 2. initialize trainer
-ppo_config = {'batch_size': 1}
-config = PPOConfig(**ppo_config)
-ppo_trainer = PPOTrainer(config, model, ref_model, tokenizer)
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
+dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train[:1%]")
+training_args = DPOConfig(output_dir="Qwen2.5-0.5B-DPO")
+
+trainer = DPOTrainer(
+    model=model,
+    ref_model=ref_model,
+    args=training_args,
+    train_dataset=dataset,
+    tokenizer=tokenizer,
+)
+trainer.train()
 ```
 
 ## Pass 8-bit reference models 
 
-<div>
-
-Since `trl` supports all key word arguments when loading a model from `transformers` using `from_pretrained`, you can also leverage `load_in_8bit` from `transformers` for more memory efficient fine-tuning.
+Since `trl` supports all keyword arguments when loading a model from `transformers` using `from_pretrained`, you can also leverage `load_in_8bit` from `transformers` for more memory efficient fine-tuning.
 
 Read more about 8-bit model loading in `transformers` [here](https://huggingface.co/docs/transformers/perf_infer_gpu_one#bitsandbytes-integration-for-int8-mixedprecision-matrix-decomposition).
 
-</div>
-
 ```python
-# 0. imports
-# pip install bitsandbytes
-import torch
-from transformers import AutoTokenizer
-from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
-
-# 1. load a pretrained model
-model = AutoModelForCausalLMWithValueHead.from_pretrained('bigscience/bloom-560m')
-ref_model = AutoModelForCausalLMWithValueHead.from_pretrained('bigscience/bloom-560m', device_map="auto", load_in_8bit=True)
-tokenizer = AutoTokenizer.from_pretrained('bigscience/bloom-560m')
-
-# 2. initialize trainer
-ppo_config = {'batch_size': 1}
-config = PPOConfig(**ppo_config)
-ppo_trainer = PPOTrainer(config, model, ref_model, tokenizer)
+from datasets import load_dataset
+from transformers import AutoModelForCausalLM, AutoTokenizer
-from transformers import AutoModelForCausalLM, AutoTokenizer
+from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
-from transformers import AutoModelForCausalLM, AutoTokenizer
+from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
+from trl import DPOConfig, DPOTrainer
+
+model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
+ref_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", load_in_8bit=True)
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
+dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
+training_args = DPOConfig(output_dir="Qwen2.5-0.5B-DPO")
+
+trainer = DPOTrainer(
+    model=model,
+    ref_model=ref_model,
+    args=training_args,
+    train_dataset=dataset,
+    tokenizer=tokenizer,
+)
+trainer.train()
 ```
 
 ## Use the CUDA cache optimizer
 
-When training large models, you should better handle the CUDA cache by iteratively clearing it. Do do so, simply pass `optimize_cuda_cache=True` to `PPOConfig`:
+When training large models, you should better handle the CUDA cache by iteratively clearing it. To do so, simply pass `optimize_cuda_cache=True` to `DPOConfig`:
 
 ```python
-config = PPOConfig(..., optimize_cuda_cache=True)
-```
-
-
-
-## Use score scaling/normalization/clipping
-As suggested by [Secrets of RLHF in Large Language Models Part I: PPO](https://huggingface.co/papers/2307.04964), we support score (aka reward) scaling/normalization/clipping to improve training stability via `PPOConfig`:
-```python
-from trl import PPOConfig
-
-ppo_config = {
-    use_score_scaling=True,
-    use_score_norm=True,
-    score_clip=0.5,
-}
-config = PPOConfig(**ppo_config)
-```
-
-To run `ppo.py`, you can use the following command:
-```
-python examples/scripts/ppo.py --log_with wandb --use_score_scaling --use_score_norm --score_clip 0.5
+training_args = DPOConfig(..., optimize_cuda_cache=True)
 ```
diff --git a/docs/source/dataset_formats.mdx b/docs/source/dataset_formats.mdx
@@ -204,7 +204,7 @@ Choosing the right dataset format depends on the task you are working on and the
 | [`NashMDTrainer`]       | [Prompt-only](#prompt-only)                             |
 | [`OnlineDPOTrainer`]    | [Prompt-only](#prompt-only)                             |
 | [`ORPOTrainer`]         | [Preference (explicit prompt)](#preference)             |
-| [`PPOv2Trainer`]        | Tokenized language modeling                             |
+| [`PPOTrainer`]          | Tokenized language modeling                             |
 | [`RewardTrainer`]       | [Preference (implicit prompt recommended)](#preference) |
 | [`SFTTrainer`]          | [Language modeling](#language-modeling)                 |
 | [`XPOTrainer`]          | [Prompt-only](#prompt-only)                             |

diff --git a/docs/source/detoxifying_a_lm.mdx b/docs/source/detoxifying_a_lm.mdx
@@ -98,19 +98,15 @@ model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", torch_dtype=
 
 and the optimizer will take care of computing the gradients in `bfloat16` precision. Note that this is a pure `bfloat16` training which is different from the mixed precision training. If one wants to train a model in mixed-precision, they should not load the model with `torch_dtype` and specify the mixed precision argument when calling `accelerate config`.
 
-- Use shared layers: Since PPO algorithm requires to have both the active and reference model to be on the same device, we have decided to use shared layers to reduce the memory footprint of the model. This can be achieved by just speifying `num_shared_layers` argument when creating a `PPOTrainer`:
+- Use shared layers: Since PPO algorithm requires to have both the active and reference model to be on the same device, we have decided to use shared layers to reduce the memory footprint of the model. This can be achieved by specifying `num_shared_layers` argument when calling `create_reference_model` function. For example, if you want to share the first 6 layers of the model, you can do it like this:
 
 <div style="text-align: center">
 <img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/trl-shared-layers.png">
 </div>
 
 ```python
-ppo_trainer = PPOTrainer(
-    model=model,
-    tokenizer=tokenizer,
-    num_shared_layers=4,
-    ...
-)
+ref_policy = create_reference_model(model, num_shared_layers=6)
+trainer = PPOTrainer(..., ref_policy=ref_policy)
 ```
 
 In the example above this means that the model have the 4 first layers frozen (i.e. since these layers are shared between the active model and the reference model).

diff --git a/docs/source/dpo_trainer.mdx b/docs/source/dpo_trainer.mdx
@@ -12,7 +12,7 @@ The abstract from the paper is the following:
 
 The first step is to train an SFT model, to ensure the data we train on is in-distribution for the DPO algorithm.
 
-Then, fine-tuning a language model via DPO consists of two steps and is easier than [PPO](ppov2_trainer):
+Then, fine-tuning a language model via DPO consists of two steps and is easier than [PPO](ppo_trainer):
 
 1. **Data collection**: Gather a [preference dataset](dataset_formats#preference) with positive and negative selected pairs of generation, given a prompt.
 2. **Optimization**: Maximize the log-likelihood of the DPO loss directly.