Merge branch 'main' into wpo

huggingface · Oct 7, 2024 · 60065eb · 60065eb
2 parents e3f9a75 + 47d08a9
commit 60065eb
Show file tree

Hide file tree

Showing 80 changed files with 1,112 additions and 864 deletions.
diff --git a/CITATION.cff b/CITATION.cff
@@ -31,4 +31,4 @@ keywords:
   - pytorch
   - transformers
 license: Apache-2.0
-version: 0.2.1
+version: 0.11.1
diff --git a/README.md b/README.md
@@ -133,7 +133,7 @@ training_args = RewardConfig(output_dir="Qwen2.5-0.5B-Reward", per_device_train_
 trainer = RewardTrainer(
     args=training_args,
     model=model,
-    tokenizer=tokenizer,
+    processing_class=tokenizer,
     train_dataset=dataset,
 )
 trainer.train()
@@ -166,7 +166,7 @@ dataset = dataset.map(lambda x: tokenizer(x["prompt"]), remove_columns="prompt")
 training_args = RLOOConfig(output_dir="Qwen2.5-0.5B-RL")
 trainer = RLOOTrainer(
     config=training_args,
-    tokenizer=tokenizer,
+    processing_class=tokenizer,
     policy=policy,
     ref_policy=ref_policy,
     reward_model=reward_model,
@@ -181,24 +181,15 @@ trainer.train()
 `DPOTrainer` implements the popular [Direct Preference Optimization (DPO) algorithm](https://huggingface.co/papers/2305.18290) that was used to post-train Llama 3 and many other models. Here is a basic example on how to use the `DPOTrainer`:
 
 ```python
-from trl import DPOConfig, DPOTrainer, maybe_extract_prompt, maybe_apply_chat_template
 from datasets import load_dataset
 from transformers import AutoModelForCausalLM, AutoTokenizer
+from trl import DPOConfig, DPOTrainer
 
-tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
 model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
-
-dataset = load_dataset("trl-lib/Capybara-Preferences", split="train")
-dataset = dataset.map(maybe_extract_prompt)
-dataset = dataset.map(maybe_apply_chat_template, fn_kwargs={"tokenizer": tokenizer})
-
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
+dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
 training_args = DPOConfig(output_dir="Qwen2.5-0.5B-DPO")
-trainer = DPOTrainer(
-    args=training_args,
-    model=model,
-    tokenizer=tokenizer,
-    train_dataset=dataset,
-)
+trainer = DPOTrainer(model=model, args=training_args, train_dataset=dataset, processing_class=tokenizer)
 trainer.train()
 ```
 

diff --git a/commands/run_sft.sh b/commands/run_sft.sh
@@ -41,7 +41,6 @@ accelerate launch $EXTRA_ACCELERATE_ARGS \
     --dataset_name $DATASET_NAME \
     --output_dir $OUTPUT_DIR \
     --max_steps $MAX_STEPS \
-    --dataset_text_field 'text' \
     --per_device_train_batch_size $BATCH_SIZE \
     --max_seq_length $SEQ_LEN \
     $EXTRA_TRAINING_ARGS

diff --git a/docs/source/alignprop_trainer.mdx b/docs/source/alignprop_trainer.mdx
@@ -1,5 +1,7 @@
 # Aligning Text-to-Image Diffusion Models with Reward Backpropagation
 
+[![](https://img.shields.io/badge/All_models-AlignProp-blue)](https://huggingface.co/models?other=alignprop,trl)
+
 ## The why
 
 If your reward function is differentiable, directly backpropagating gradients from the reward models to the diffusion model is significantly more sample and compute efficient (25x) than doing policy gradient algorithm like DDPO.

diff --git a/docs/source/bco_trainer.mdx b/docs/source/bco_trainer.mdx
@@ -1,5 +1,7 @@
 # BCO Trainer
 
+[![](https://img.shields.io/badge/All_models-BCO-blue)](https://huggingface.co/models?other=bco,trl)
+
 TRL supports the Binary Classifier Optimization (BCO).
 The [BCO](https://huggingface.co/papers/2404.04656) authors train a binary classifier whose logit serves as a reward so that the classifier maps {prompt, chosen completion} pairs to 1 and {prompt, rejected completion} pairs to 0.
 For a full example have a look at  [`examples/scripts/bco.py`].
@@ -30,7 +32,7 @@ bco_trainer = BCOTrainer(
     model_ref,
     args=training_args,
     train_dataset=train_dataset,
-    tokenizer=tokenizer,
+    processing_class=tokenizer,
 )
 ```
 After this one can then call:
@@ -73,7 +75,7 @@ bco_trainer = BCOTrainer(
     model_ref,
     args=training_args,
     train_dataset=train_dataset,
-    tokenizer=tokenizer,
+    processing_class=tokenizer,
     embedding_func=embedding_func,
     embedding_tokenizer=self.embedding_tokenizer,
 )

diff --git a/docs/source/clis.mdx b/docs/source/clis.mdx
@@ -26,8 +26,6 @@ model_name_or_path:
   trl-internal-testing/tiny-random-LlamaForCausalLM
 dataset_name:
   stanfordnlp/imdb
-dataset_text_field:
-  text
 report_to:
   none
 learning_rate:

diff --git a/docs/source/cpo_trainer.mdx b/docs/source/cpo_trainer.mdx
diff --git a/docs/source/dataset_formats.mdx b/docs/source/dataset_formats.mdx
@@ -180,6 +180,8 @@ preference_example = {"prompt": "The sky is", "chosen": " blue.", "rejected": "
 preference_example = {"chosen": "The sky is blue.", "rejected": "The sky is green."}
 ```
 
+Some preference datasets can be found with [the tag `dpo` on Hugging Face Hub](https://huggingface.co/datasets?other=dpo). You can also explore the [librarian-bots' DPO Collections](https://huggingface.co/collections/librarian-bots/direct-preference-optimization-datasets-66964b12835f46289b6ef2fc) to identify preference datasets.
+
 ### Unpaired preference
 
 An unpaired preference dataset is similar to a preference dataset but instead of having `"chosen"` and `"rejected"` completions for the same prompt, it includes a single `"completion"` and a `"label"` indicating whether the completion is preferred or not.
@@ -192,20 +194,20 @@ unpaired_preference_example = {"prompt": "The sky is", "completion": " blue.", "
 
 Choosing the right dataset format depends on the task you are working on and the specific requirements of the TRL trainer you are using. Below is a brief overview of the dataset formats supported by each TRL trainer.
 
-| Trainer                 | Expected dataset format      |
-| ----------------------- | ---------------------------- |
-| [`BCOTrainer`]          | Unpaired preference          |
-| [`CPOTrainer`]          | Preference (explicit prompt) |
-| [`DPOTrainer`]          | Preference (explicit prompt) |
-| [`IterativeSFTTrainer`] | Unpaired preference          |
-| [`KTOTrainer`]          | Unpaired preference          |
-| [`NashMDTrainer`]       | Prompt-only                  |
-| [`OnlineDPOTrainer`]    | Prompt-only                  |
-| [`ORPOTrainer`]         | Preference (explicit prompt) |
-| [`PPOv2Trainer`]        | Tokenized language modeling  |
-| [`RewardTrainer`]       | Preference (implicit prompt) |
-| [`SFTTrainer`]          | Language modeling            |
-| [`XPOTrainer`]          | Prompt-only                  |
+| Trainer                 | Expected dataset format                                 |
+| ----------------------- | ------------------------------------------------------- |
+| [`BCOTrainer`]          | [Unpaired preference](#unpaired-preference)             |
+| [`CPOTrainer`]          | [Preference (explicit prompt recommended)](#preference) |
+| [`DPOTrainer`]          | [Preference (explicit prompt recommended)](#preference) |
+| [`IterativeSFTTrainer`] | [Unpaired preference](#unpaired-preference)             |
+| [`KTOTrainer`]          | [Unpaired preference](#unpaired-preference)             |
+| [`NashMDTrainer`]       | [Prompt-only](#prompt-only)                             |
+| [`OnlineDPOTrainer`]    | [Prompt-only](#prompt-only)                             |
+| [`ORPOTrainer`]         | [Preference (explicit prompt)](#preference)             |
+| [`PPOv2Trainer`]        | Tokenized language modeling                             |
+| [`RewardTrainer`]       | [Preference (implicit prompt recommended)](#preference) |
+| [`SFTTrainer`]          | [Language modeling](#language-modeling)                 |
+| [`XPOTrainer`]          | [Prompt-only](#prompt-only)                             |
 
 <Tip>
 
@@ -710,3 +712,35 @@ dataset = dataset.remove_columns(["completion", "label"])
 >>> dataset[0]
 {'prompt': 'The sky is'}
 ```
+
+## Vision datasets
+
+Some trainers also support fine-tuning vision-language models (VLMs) using image-text pairs. In this scenario, it's recommended to use a conversational format, as each model handles image placeholders in text differently. 
+
+A conversational vision dataset differs from a standard conversational dataset in two key ways:
+
+1. The dataset must contain the key `images` with the image data.
+2. The `"content"` field in messages must be a list of dictionaries, where each dictionary specifies the type of data: `"image"` or `"text"`.
+
+Example:
+
+```python
+# Textual dataset format:
+"content": "What color is the sky?"
+
+# Vision dataset format:
+"content": [
+    {"type": "image"}, 
+    {"type": "text", "text": "What color is the sky in the image?"}
+]
+```
+
+An example of a conversational vision dataset is the [openbmb/RLAIF-V-Dataset](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset). Below is an embedded view of the dataset's training data, allowing you to explore it directly:
+
+<iframe
+  src="https://huggingface.co/datasets/trl-lib/rlaif-v/embed/viewer/default/train"
+  frameborder="0"
+  width="100%"
+  height="560px"
+></iframe>
+
diff --git a/docs/source/ddpo_trainer.mdx b/docs/source/ddpo_trainer.mdx
@@ -1,4 +1,7 @@
 # Denoising Diffusion Policy Optimization
+
+[![](https://img.shields.io/badge/All_models-DDPO-blue)](https://huggingface.co/models?other=ddpo,trl)
+
 ## The why
 
 | Before | After DDPO finetuning |