Skip to content

Commit

Permalink
Merge branch 'main' into wpo
Browse files Browse the repository at this point in the history
  • Loading branch information
kashif authored Oct 7, 2024
2 parents e3f9a75 + 47d08a9 commit 60065eb
Show file tree
Hide file tree
Showing 80 changed files with 1,112 additions and 864 deletions.
2 changes: 1 addition & 1 deletion CITATION.cff
Original file line number Diff line number Diff line change
Expand Up @@ -31,4 +31,4 @@ keywords:
- pytorch
- transformers
license: Apache-2.0
version: 0.2.1
version: 0.11.1
21 changes: 6 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,7 @@ training_args = RewardConfig(output_dir="Qwen2.5-0.5B-Reward", per_device_train_
trainer = RewardTrainer(
args=training_args,
model=model,
tokenizer=tokenizer,
processing_class=tokenizer,
train_dataset=dataset,
)
trainer.train()
Expand Down Expand Up @@ -166,7 +166,7 @@ dataset = dataset.map(lambda x: tokenizer(x["prompt"]), remove_columns="prompt")
training_args = RLOOConfig(output_dir="Qwen2.5-0.5B-RL")
trainer = RLOOTrainer(
config=training_args,
tokenizer=tokenizer,
processing_class=tokenizer,
policy=policy,
ref_policy=ref_policy,
reward_model=reward_model,
Expand All @@ -181,24 +181,15 @@ trainer.train()
`DPOTrainer` implements the popular [Direct Preference Optimization (DPO) algorithm](https://huggingface.co/papers/2305.18290) that was used to post-train Llama 3 and many other models. Here is a basic example on how to use the `DPOTrainer`:

```python
from trl import DPOConfig, DPOTrainer, maybe_extract_prompt, maybe_apply_chat_template
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOConfig, DPOTrainer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")

dataset = load_dataset("trl-lib/Capybara-Preferences", split="train")
dataset = dataset.map(maybe_extract_prompt)
dataset = dataset.map(maybe_apply_chat_template, fn_kwargs={"tokenizer": tokenizer})

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
training_args = DPOConfig(output_dir="Qwen2.5-0.5B-DPO")
trainer = DPOTrainer(
args=training_args,
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
)
trainer = DPOTrainer(model=model, args=training_args, train_dataset=dataset, processing_class=tokenizer)
trainer.train()
```

Expand Down
1 change: 0 additions & 1 deletion commands/run_sft.sh
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,6 @@ accelerate launch $EXTRA_ACCELERATE_ARGS \
--dataset_name $DATASET_NAME \
--output_dir $OUTPUT_DIR \
--max_steps $MAX_STEPS \
--dataset_text_field 'text' \
--per_device_train_batch_size $BATCH_SIZE \
--max_seq_length $SEQ_LEN \
$EXTRA_TRAINING_ARGS
Expand Down
2 changes: 2 additions & 0 deletions docs/source/alignprop_trainer.mdx
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Aligning Text-to-Image Diffusion Models with Reward Backpropagation

[![](https://img.shields.io/badge/All_models-AlignProp-blue)](https://huggingface.co/models?other=alignprop,trl)

## The why

If your reward function is differentiable, directly backpropagating gradients from the reward models to the diffusion model is significantly more sample and compute efficient (25x) than doing policy gradient algorithm like DDPO.
Expand Down
6 changes: 4 additions & 2 deletions docs/source/bco_trainer.mdx
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# BCO Trainer

[![](https://img.shields.io/badge/All_models-BCO-blue)](https://huggingface.co/models?other=bco,trl)

TRL supports the Binary Classifier Optimization (BCO).
The [BCO](https://huggingface.co/papers/2404.04656) authors train a binary classifier whose logit serves as a reward so that the classifier maps {prompt, chosen completion} pairs to 1 and {prompt, rejected completion} pairs to 0.
For a full example have a look at [`examples/scripts/bco.py`].
Expand Down Expand Up @@ -30,7 +32,7 @@ bco_trainer = BCOTrainer(
model_ref,
args=training_args,
train_dataset=train_dataset,
tokenizer=tokenizer,
processing_class=tokenizer,
)
```
After this one can then call:
Expand Down Expand Up @@ -73,7 +75,7 @@ bco_trainer = BCOTrainer(
model_ref,
args=training_args,
train_dataset=train_dataset,
tokenizer=tokenizer,
processing_class=tokenizer,
embedding_func=embedding_func,
embedding_tokenizer=self.embedding_tokenizer,
)
Expand Down
2 changes: 0 additions & 2 deletions docs/source/clis.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,6 @@ model_name_or_path:
trl-internal-testing/tiny-random-LlamaForCausalLM
dataset_name:
stanfordnlp/imdb
dataset_text_field:
text
report_to:
none
learning_rate:
Expand Down
147 changes: 71 additions & 76 deletions docs/source/cpo_trainer.mdx

Large diffs are not rendered by default.

62 changes: 48 additions & 14 deletions docs/source/dataset_formats.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -180,6 +180,8 @@ preference_example = {"prompt": "The sky is", "chosen": " blue.", "rejected": "
preference_example = {"chosen": "The sky is blue.", "rejected": "The sky is green."}
```

Some preference datasets can be found with [the tag `dpo` on Hugging Face Hub](https://huggingface.co/datasets?other=dpo). You can also explore the [librarian-bots' DPO Collections](https://huggingface.co/collections/librarian-bots/direct-preference-optimization-datasets-66964b12835f46289b6ef2fc) to identify preference datasets.

### Unpaired preference

An unpaired preference dataset is similar to a preference dataset but instead of having `"chosen"` and `"rejected"` completions for the same prompt, it includes a single `"completion"` and a `"label"` indicating whether the completion is preferred or not.
Expand All @@ -192,20 +194,20 @@ unpaired_preference_example = {"prompt": "The sky is", "completion": " blue.", "

Choosing the right dataset format depends on the task you are working on and the specific requirements of the TRL trainer you are using. Below is a brief overview of the dataset formats supported by each TRL trainer.

| Trainer | Expected dataset format |
| ----------------------- | ---------------------------- |
| [`BCOTrainer`] | Unpaired preference |
| [`CPOTrainer`] | Preference (explicit prompt) |
| [`DPOTrainer`] | Preference (explicit prompt) |
| [`IterativeSFTTrainer`] | Unpaired preference |
| [`KTOTrainer`] | Unpaired preference |
| [`NashMDTrainer`] | Prompt-only |
| [`OnlineDPOTrainer`] | Prompt-only |
| [`ORPOTrainer`] | Preference (explicit prompt) |
| [`PPOv2Trainer`] | Tokenized language modeling |
| [`RewardTrainer`] | Preference (implicit prompt) |
| [`SFTTrainer`] | Language modeling |
| [`XPOTrainer`] | Prompt-only |
| Trainer | Expected dataset format |
| ----------------------- | ------------------------------------------------------- |
| [`BCOTrainer`] | [Unpaired preference](#unpaired-preference) |
| [`CPOTrainer`] | [Preference (explicit prompt recommended)](#preference) |
| [`DPOTrainer`] | [Preference (explicit prompt recommended)](#preference) |
| [`IterativeSFTTrainer`] | [Unpaired preference](#unpaired-preference) |
| [`KTOTrainer`] | [Unpaired preference](#unpaired-preference) |
| [`NashMDTrainer`] | [Prompt-only](#prompt-only) |
| [`OnlineDPOTrainer`] | [Prompt-only](#prompt-only) |
| [`ORPOTrainer`] | [Preference (explicit prompt)](#preference) |
| [`PPOv2Trainer`] | Tokenized language modeling |
| [`RewardTrainer`] | [Preference (implicit prompt recommended)](#preference) |
| [`SFTTrainer`] | [Language modeling](#language-modeling) |
| [`XPOTrainer`] | [Prompt-only](#prompt-only) |

<Tip>

Expand Down Expand Up @@ -710,3 +712,35 @@ dataset = dataset.remove_columns(["completion", "label"])
>>> dataset[0]
{'prompt': 'The sky is'}
```

## Vision datasets

Some trainers also support fine-tuning vision-language models (VLMs) using image-text pairs. In this scenario, it's recommended to use a conversational format, as each model handles image placeholders in text differently.

A conversational vision dataset differs from a standard conversational dataset in two key ways:

1. The dataset must contain the key `images` with the image data.
2. The `"content"` field in messages must be a list of dictionaries, where each dictionary specifies the type of data: `"image"` or `"text"`.

Example:

```python
# Textual dataset format:
"content": "What color is the sky?"

# Vision dataset format:
"content": [
{"type": "image"},
{"type": "text", "text": "What color is the sky in the image?"}
]
```

An example of a conversational vision dataset is the [openbmb/RLAIF-V-Dataset](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset). Below is an embedded view of the dataset's training data, allowing you to explore it directly:

<iframe
src="https://huggingface.co/datasets/trl-lib/rlaif-v/embed/viewer/default/train"
frameborder="0"
width="100%"
height="560px"
></iframe>

3 changes: 3 additions & 0 deletions docs/source/ddpo_trainer.mdx
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
# Denoising Diffusion Policy Optimization

[![](https://img.shields.io/badge/All_models-DDPO-blue)](https://huggingface.co/models?other=ddpo,trl)

## The why

| Before | After DDPO finetuning |
Expand Down
Loading

0 comments on commit 60065eb

Please sign in to comment.