Trainer: add predict with generate #32346

zucchini-nlp · 2024-07-31T06:09:46Z

What does this PR do?

Fixes #26474, fixes #31462, fixes #33396 and fixes #31672. This PR adds possibility to generate and compute metrics on generated texts for decoder-only models.

The basic idea is almost same as in Seq2Seq Trainer, but decoder-only models need a prompt-only input for generation. While for loss computation we need the whole input. Therefore we can ask users to prepare train and eval datasets, so that the eval contains generation_inputs used for generation. Additionally, to make user's life easier, I added a possibility to pass in different collators for train and for eval/test datasets.

The args used for generation should be set via GenerationConfig, as imo that makes most sense instead of adding only max_length and num_beams as in Seq2SeqTrainer.

The code was tested with the below dummy train script.

import random
import torch
from datasets import load_dataset
from peft import LoraConfig
from transformers import AutoProcessor, BitsAndBytesConfig, Idefics2ForConditionalGeneration, TrainingArguments, Trainer

DEVICE = "cuda:0"
processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b", do_image_splitting=False)
pad_token_id = processor.tokenizer.pad_token_id

lora_config = LoraConfig(
    r=8,
    lora_alpha=8,
    lora_dropout=0.1,
    target_modules='.*(text_model|modality_projection|perceiver_resampler).*(down_proj|gate_proj|up_proj|k_proj|q_proj|v_proj|o_proj).*$',
    use_dora=False,
    init_lora_weights="gaussian"
)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

model = Idefics2ForConditionalGeneration.from_pretrained(
    "HuggingFaceM4/idefics2-8b",
    torch_dtype=torch.float16,
    quantization_config=bnb_config,
)

model.add_adapter(lora_config)
model.enable_adapters()


eval_dataset = load_dataset("nielsr/docvqa_1200_examples", split="test")
eval_dataset = eval_dataset.remove_columns(['id', 'words', 'bounding_boxes', 'answer'])
eval_dataset = eval_dataset.select(range(10))


class DataCollatorForGeneration:
    def __init__(self, processor, eval_mode=False):
        self.processor = processor
        self.image_token_id = processor.tokenizer.additional_special_tokens_ids[
            processor.tokenizer.additional_special_tokens.index("<image>")
        ]
        self.eval_mode = eval_mode

    def __call__(self, examples):
        texts, texts_eval = [], []
        images = []
        for example in examples:
            image = example["image"]
            question = example["query"]["en"]
            answer = random.choice(example["answers"])
            messages = [
                {
                    "role": "user",
                    "content": [
                        {"type": "text", "text": "Answer briefly."},
                        {"type": "image"},
                        {"type": "text", "text": question}
                    ]
                },
                {
                    "role": "assistant",
                    "content": [
                        {"type": "text", "text": answer}
                    ]
                }
            ]
            text = processor.apply_chat_template(messages, add_generation_prompt=False)
            text_eval = processor.apply_chat_template([messages[0]], add_generation_prompt=True)
            texts.append(text.strip())
            texts_eval.append(text_eval.strip())
            images.append([image])

        # Make sure we have right padding in train and left padding for eval parts
        processor.tokenizer.padding_side = "right"
        batch = processor(text=texts, images=images, return_tensors="pt", padding=True) 
        
        if self.eval_mode:
            processor.tokenizer.padding_side = "left"
            batch_eval = processor(text=texts, images=images, return_tensors="pt", padding=True)
            batch['generation_input_ids'] = batch_eval['input_ids']
            batch['generation_attention_mask'] = batch_eval['attention_mask']

        labels = batch["input_ids"].clone()
        labels[labels == processor.tokenizer.pad_token_id] = self.image_token_id
        batch["labels"] = labels

        return batch

gen_config = model.generation_config
gen_config.max_length = 200

training_args = TrainingArguments(
    max_steps=100,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=5,
    gradient_accumulation_steps=2,
    output_dir="tmp_delete",
    eval_strategy="steps",
    fp16=True,
    remove_unused_columns=False,
    report_to="none",
    predict_with_generate=True,
    generation_config=gen_config,
)

def custom_metrics(prediction_dict):
    # unmask for correct detokenization, because preds are padded to max length with -100
    preds = prediction_dict.predictions
    preds[preds == -100] = pad_token_id
    lbls = prediction_dict.label_ids
    lbls[lbls == -100] = pad_token_id

    # Decode and do magic for metrics
    preds = processor.batch_decode(preds)
    lbls = processor.batch_decode(lbls)
    bleu = rouge = 0
    return {"bleu" : bleu, "rouge": rouge}


trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=DataCollatorForGeneration(processor),
    eval_data_collator=DataCollatorForGeneration(processor, eval_mode=True),
    train_dataset=eval_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=custom_metrics,
)

print(trainer.evaluate())

HuggingFaceDocBuilderDev · 2024-07-31T06:29:34Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

LysandreJik · 2024-07-31T09:41:26Z

This seems good to me from a quick read, pinging @SunMarc and @muellerzr who have more experience with that code than I do.

salrowili · 2024-08-01T04:10:51Z

Can you please show an example how this code can work with HF text dataset (not the multimodal dataset) without Idefics2 processor? I mean using tokenizer.apply_chat_template ? how right and left padding would be handle in this case?

zucchini-nlp · 2024-08-01T04:46:21Z

@salrowili it should be similar to Idefics with the only difference that instead of processor.tokenizer you have simply tokenizer. The main thing to note is that Trainer needs inputs for loss calculation which are prepared same was as they were always, and also needs inputs for generation which should be prepended with generation_ prefix and left-padded

Below is a modified version of Idefics script, should work for text models

class DataCollatorForGeneration:
    def __init__(self, tokenizer, eval_mode=False):
        self.tokenizer = tokenizer
        self.eval_mode = eval_mode

    def __call__(self, examples):
        texts, texts_eval = [], []
        images = []
        for example in examples:
            question = example["query"]["en"]
            answer = random.choice(example["answers"])
            messages = [
                {
                    "role": "user",
                    "content": f"Answer question: {question}"
                },
                {
                    "role": "assistant",
                    "content": answer
                }
            ]
            text = tokenizer.apply_chat_template(messages, add_generation_prompt=False)
            text_eval = tokenizer.apply_chat_template([messages[0]], add_generation_prompt=True)
            texts.append(text.strip())
            texts_eval.append(text_eval.strip())
            images.append([image])

        # Make sure we have right padding in train and left padding for eval parts
        tokenizer.padding_side = "right"
        batch = tokenizer(text=texts, return_tensors="pt", padding=True) 
        
        if self.eval_mode:
            tokenizer.padding_side = "left"
            batch_eval = tokenizer(text=texts, return_tensors="pt", padding=True)
            batch['generation_input_ids'] = batch_eval['input_ids']
            batch['generation_attention_mask'] = batch_eval['attention_mask']

        labels = batch["input_ids"].clone()
        labels[labels == tokenizer.pad_token_id] = -100 # Ignore index for CE-loss
        batch["labels"] = labels

salrowili · 2024-08-02T01:04:11Z

@salrowili it should be similar to Idefics with the only difference that instead of processor.tokenizer you have simply tokenizer. The main thing to note is that Trainer needs inputs for loss calculation which are prepared same was as they were always, and also needs inputs for generation which should be prepended with generation_ prefix and left-padded

Below is a modified version of Idefics script, should work for text models

class DataCollatorForGeneration:
    def __init__(self, tokenizer, eval_mode=False):
        self.tokenizer = tokenizer
        self.eval_mode = eval_mode

    def __call__(self, examples):
        texts, texts_eval = [], []
        images = []
        for example in examples:
            question = example["query"]["en"]
            answer = random.choice(example["answers"])
            messages = [
                {
                    "role": "user",
                    "content": f"Answer question: {question}"
                },
                {
                    "role": "assistant",
                    "content": answer
                }
            ]
            text = tokenizer.apply_chat_template(messages, add_generation_prompt=False)
            text_eval = tokenizer.apply_chat_template([messages[0]], add_generation_prompt=True)
            texts.append(text.strip())
            texts_eval.append(text_eval.strip())
            images.append([image])

        # Make sure we have right padding in train and left padding for eval parts
        tokenizer.padding_side = "right"
        batch = tokenizer(text=texts, return_tensors="pt", padding=True) 
        
        if self.eval_mode:
            tokenizer.padding_side = "left"
            batch_eval = tokenizer(text=texts, return_tensors="pt", padding=True)
            batch['generation_input_ids'] = batch_eval['input_ids']
            batch['generation_attention_mask'] = batch_eval['attention_mask']

        labels = batch["input_ids"].clone()
        labels[labels == tokenizer.pad_token_id] = -100 # Ignore index for CE-loss
        batch["labels"] = labels

@zucchini-nlp Thank you for the update. I have added some lines to the code to make a complete example for QA task.

from transformers import AutoModelForCausalLM, AutoTokenizer,BitsAndBytesConfig,TrainingArguments, Trainer
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer, DataCollatorForCompletionOnlyLM
from peft import LoraConfig
import torch
from torchmetrics.text import SQuAD
from random import randrange
from transformers.utils import logging

dataset = load_dataset("Stanford/web_questions")

train_dataset=dataset["train"]
eval_dataset=dataset["test"]

eval_dataset = eval_dataset.select(range(256))

quant_config=BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_type=torch.bfloat16
)


model_id="meta-llama/Meta-Llama-3.1-8B"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quant_config,
    device_map="auto",
    torch_dtype=torch.float16,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

tokenizer.add_special_tokens({"pad_token":"</s>"})
pad_token_id = tokenizer.pad_token_id


model.resize_token_embeddings(len(tokenizer),pad_to_multiple_of=8)


gen_config=gen_config = model.generation_config
gen_config.max_length = 256

peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model.add_adapter(peft_config)
model.enable_adapters()

tokenizer.chat_template = "{% for message in messages %}{% if message['role'] == 'user' %}{{ ' ' }}{% endif %}{{ message['content'] }}{% if not loop.last %}{{ '  ' }}{% endif %}{% endfor %}{{ eos_token }}"

class DataCollatorForGeneration:
    def __init__(self, tokenizer, eval_mode=False):
        self.tokenizer = tokenizer
        self.eval_mode = eval_mode

    def __call__(self, examples):
        texts, texts_eval = [], []
        for example in examples:
            question = example["question"]
            answer = example["answers"][0] ### webquestion dataset has multiple answers so to make the code simple we choost the first answer
            messages = [
                {
                    "role": "user",
                    "content": f"Answer the following question: {question}"
                },
                {
                    "role": "assistant",
                    "content": answer
                }
            ]
            
            text = tokenizer.apply_chat_template(messages, add_generation_prompt=False,tokenize=False)
            text_eval = tokenizer.apply_chat_template([messages[0]],add_generation_prompt=True,tokenize=False)
            texts.append(text.strip())
            texts_eval.append(text_eval.strip())
            ## uncomment to check template format
            # print(text)
            #print(text_eval)
            #exit()
        # Make sure we have right padding in train and left padding for eval parts
        tokenizer.padding_side = "right"
        batch = tokenizer(text=texts, return_tensors="pt", padding=True) 
        
        if self.eval_mode:
            tokenizer.padding_side = "left"
            batch_eval = tokenizer(text=texts_eval, return_tensors="pt", padding=True)
            batch['generation_input_ids'] = batch_eval['input_ids']
            batch['generation_attention_mask'] = batch_eval['attention_mask']
        labels = batch["input_ids"].clone()
        labels[labels == tokenizer.pad_token_id] = -100 # Ignore index for CE-loss
        batch["labels"] = labels
        return batch




def custom_metrics(prediction_dict):
    # unmask for correct detokenization, because preds are padded to max length with -100
    preds = prediction_dict.predictions
    preds[preds == -100] = pad_token_id
    lbls = prediction_dict.label_ids
    lbls[lbls == -100] = pad_token_id

    # Decode and do magic for metrics
    preds = tokenizer.batch_decode(preds,skip_special_tokens=True)
    lbls = tokenizer.batch_decode(lbls,skip_special_tokens=True)
    ## uncomment if you want to see all special token (e.g, EOS)
    #preds = tokenizer.batch_decode(preds)
    #lbls = tokenizer.batch_decode(lbls)
    print("\n\n\n",'=' * 40,"Labels",'=' * 40)
    for item_x in lbls[:5]:
        print(item_x,"\n")
    print("\n",'=' * 40,"Predictions",'=' * 40)
    for item_x in preds[:5]:
        print(item_x,"\n")
    print("\n",'=' * 80)
    pred_list=[]
    label_list=[]
    idx=0
    ## visit https://lightning.ai/docs/torchmetrics/stable/text/squad.html for reference ##
    for x,y in zip(preds,lbls):
       pred_list.append({"prediction_text": x.split("?")[1], "id": idx})
       label_list.append({"answers": {"text": [y.split("?")[1]]}, "id": idx})
       squad = SQuAD()(pred_list,label_list)
       em_score=squad["exact_match"].item()
       f1_score=squad["f1"].item()
       idx+=1
    return {"exact_match" : em_score, "f1_score": f1_score}

def preprocess_logits_for_metrics(logits, labels):
        """Helper function for logits preprocessing for metrics"""
        preds = torch.argmax(logits, dim=-1)
        return preds, labels

training_args = TrainingArguments(
    per_device_train_batch_size=8,
    per_device_eval_batch_size=128,
    num_train_epochs=20,
    do_train=True,
    do_eval=True,
    eval_strategy="steps",
    eval_steps=500,
    save_steps=500000,
    bf16=True,
    output_dir="./test_predict",
    overwrite_output_dir=True,
    optim="adafactor",
    report_to="none",
    logging_steps=100000,
    remove_unused_columns=False,
    predict_with_generate=True,
    generation_config=gen_config)


trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=DataCollatorForGeneration(tokenizer),
    eval_data_collator=DataCollatorForGeneration(tokenizer, eval_mode=True),
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=custom_metrics,
)

trainer.train()

Here is my comments and question to think about:

Evaluation is very slow. I had to increase the batch size to 128. Usually with seq2seq trainer batch size after 8 would not affect the speed but in this example more batch size led to more inference and evaluation speed. This become a problem with XLA (e.g. TPU) with FSDP. The evaluation will freeze forever.
Is there any way to speed up the evaluation with pack=True?
I think we need to integrate the code to trl library including all trainer codes in trl repo (e.g, https://github.com/huggingface/trl/blob/main/trl/trainer/sft_trainer.py)
The code need "report_to" to be assigned to none. Otherwise there will be problem with generation config format. The code you post accept the generation config in json format where the model.generation_config will be different (e.g, for Llama) .
We need to fed the prompt (e.g., instruction and prompt) to compute_metric or preprocess_logits_for_metrics in trainer and SFT trainer. This way we can split the completion (e.g. answer) from the whole sequence (e.g., instructing + Question).
In this code i did not use the "preprocess_logits_for_metrics" function. do you think doing some codes in preprocess_logits_for_metrics function can speed up the evaluation?

The output for this code is pretty decent :


 ======================================== Labels ========================================
Answer the following question: what does jamaican people speak?  Jamaican Creole English Language

Answer the following question: what did james k polk do before he was president?  Lawyer

Answer the following question: what is the oregon ducks 2012 football schedule?  University of Oregon

Answer the following question: who plays ken barlow in coronation street?  Tony Warren

Answer the following question: what happened after mr. sugihara died?  Yaotsu


 ======================================== Predictions ========================================
Answer the following question: what does jamaican people speak?  Jamaican Creole English Language#include<fstream.h>ndependent institution?

Answer the following question: what did james k polk do before he was president?  Lawyer(Redirected from James K. Polk

Answer the following question: what is the oregon ducks 2012 football schedule?  University of Oregon"""
University of Oregon Ducks football

Answer the following question: who plays ken barlow in coronation street?  Tony Warrendef乍

Answer the following question: what happened after mr. sugihara died?  Yaotsu``

Thank you

zucchini-nlp · 2024-08-02T05:19:29Z

Thanks for feedback and testing the feature!

Hmm, let me see the speed issue, my guess is that generation in general will require more time as you are doing extra forward passes through the model. Increasing batch size makes it faster as now you generate 128 examples in one go. I will see if I can make generate faster, as we already do two forward passes, we could somehow use the past-key-values. Also will look into packing
Regarding TRL library, let's merge the PR in transformers first and we can port the changes to TRL if everyone is happy with the code
Are you reporting to WandB? I'll try to do report and see how to make it work
The question with prompts should be doable with current impl if we ask Trainer to include_inputs_for_metrics, iiuc that will let us see the prompt used by model and the whole completion generated

zucchini-nlp · 2024-08-16T06:04:10Z

Seems like there's not much we can do about long evaluation time when generating. I tried to track how long it takes with decoder-only and encoder-decoder models. Indeed most encoder-decoder models are fast mainly because they're lightweight while the model you tried with is 8B parameters. I did several checks to verify that the evaluation speed is approximately same when we have models of similar size. Increasing batch size is one option to generate faster as you tried already. Another option is to generate only on a small sample of the eval set, and let users enable generation on the whole dataset if they want to

Also, logging on WandB is working for me and the generation config is logged as a simple dict, can you share what errors you got there @salrowili ?

salrowili · 2024-08-16T11:06:31Z

Hi @zucchini-nlp . When i state that the prediction is slow i compare it to this script here https://huggingface.co/docs/trl/en/sft_trainer, which is much faster. I think one possible way to solve this problem is to integrate your code to SFTTrainer class from trl repo and see if the speed has changed. Another way is to do it through eval_packing which will group couple of example together to fill the sequence. see : https://github.com/huggingface/trl/blob/314e8eb367cbfaf74c2e9717085346360e779508/trl/trainer/sft_trainer.py#L110 .

For wandb logging, to reproduce the error, just change report_to from None to wandb and you will get the error. but this issue is minor as we can overcome it by using wandb.init and wandb.log inside the code it self.

zucchini-nlp · 2024-08-16T11:21:33Z

Oke, so it's SFTTrainer, then I'll see what is different there. For packing, we can calculate loss with packing but not generate, since generation tries to continue next several tokens from sequence and in a packed sequence there might be more than one.

In general, we had an idea to try out torch.nested_tensor which is the most similar to packing, but it won't be soon

src/transformers/trainer.py

shubhamjain0594 · 2024-08-28T15:05:27Z

Thanks for this PR @zucchini-nlp, hoping it gets merged soon. I am using similar thing internally to train decoder-only models for information extraction. I saw a concern that this is slower than traditional SFT Trainer and something I experienced as well.

My belief is that this might be mainly because in SFT Trainer, during prediction it predicts n+1 token using previous n tokens that come from prompt + ground truth. While in this case, previous n tokens come from prompt + predictions. So it cannot be parallelized same way as the SFT Trainer, where you can literally predict n+1, n+2... tokens in parallel.

My belief comes from the fact that I saw a drop in eval performance and increase in time when using predict_with_generate, compared to using SFTTrainer as it is.

qiuosier · 2024-08-21T19:58:31Z

src/transformers/trainer.py

+                    for k, v in generation_inputs.items()
+                    if k.replace("generation_", "") not in gen_keys and "generation" not in k
+                }
+                generated_tokens = self.model.generate(


Shall we use model instead of self.model here? In the evaluation_loop(), the self.model is wrapped and the wrapped model may not always be the same as self.model. I think this is for the case when deepspeed zero3 is enabled and evalute_on_start is set to true.

For inference we don't wrap for distributed mode, but I changed for model because there are some other steps run before returning the model. The original code was adapted from seq2seq trainer, so I modified it there too

transformers/src/transformers/trainer.py

Lines 1762 to 1765 in c409cd8

# Note: in torch.distributed mode, there's no point in wrapping the model

# inside a DistributedDataParallel as we'll be under `no_grad` anyways.

if not training:

return model

zucchini-nlp · 2024-08-30T10:51:54Z

@shubhamjain0594 Yes, that's exactly what I meant that generation is expected to take more time than simple forward. As per the last comment from @salrowili , I compared SFT and HFTrained, with and without generation. I don't see any slow down caused by HFTrainer specifically, as the both of them rely on the same code to do training and evalution. The only diff is that SFT support packed dataset while HFTrainer doesn't.

The current implementation of predict_with_generate doesn't support packing for reasons mentioned above. To be more specific generation cannot happen if we pack dataset, because we don't support it in transformers yet. The model generates the next token given all prev tokens, and packing would result in several prompts being merged thus bad quality generation.

I think we can let users use a small sample of the eval set for generation, if they don't want to slow down evaluation loop. Applying some generation optimization tricks here might not be the optimal solution, as we are trying to verify how good the model is learning. The only technique I can think of that can be used is torch compile, but it is still a very new feature and I would rather not integrate it in GFTrainer yet.

As per WandB, it worked for me in SFT and HFTrained, the generation config is logged as a dict in parameters

So, I think I can request review from @muellerz now. The PR isn't very high priority, so feel free to take a look whenever you have bandwidth

SunMarc

Thanks for this PR @zucchini-nlp! This PR looks very good. I left a few comments ! In order to simplify things, I'm thinking that it would be easier to just add a generation_kwargs, WDYT @muellerzr @zucchini-nlp ?

src/transformers/trainer.py

run.py

seq2seq.py

SunMarc · 2024-08-30T13:49:56Z

src/transformers/trainer.py

+            synced_gpus = gen_kwargs.get("synced_gpus", default_synced_gpus)
+            if len(gen_kwargs) > 0:
+                unused_kwargs = gen_config.update(**gen_kwargs)
+                if unused_kwargs:
+                    logger.warning_once(
+                        "Following generation related kwargs were passed to `prediction_step` but not "


I think that if you pass synced_gpus in gen_kwargs, the warning will appear since it will be in unused_kwargs. Maybe do pop instead. Also this will trigger the warning in other places also.

SunMarc · 2024-08-30T13:50:37Z

src/transformers/trainer.py

+        test_dataset: Dataset,
+        ignore_keys: Optional[List[str]] = None,
+        metric_key_prefix: str = "test",
+        **gen_kwargs,


you also need to add it for prediction_step

SunMarc · 2024-08-30T13:53:19Z

src/transformers/trainer.py

+                # Therefore, generation_config should be available
+                self.gen_config = self.model.generation_config


shouldn't we also update the config with gen_kwargs ?

you mean add kwargs from model.config to generation config? It shouldn't be necessary because the base model.generation_config should contain all generation related kwargs after the model is loaded. So we just need to make sure user-passed kwargs have higher priority than trainer.generation_config

I'm taking about the gen_kwargs that you are passing in predict. I would expect that self.gen_config is updated when the user pass gen_kwargs in the predict function in all cases (important in the case we pass a generate kwargs such as synced_gpus ). By default, it is equal to self.model.generation_config but if the user passes it in TrainingArguments, it will be equal to self.args.generation_config.

Ah I see now, right, we should be updating i any case

SunMarc · 2024-08-30T13:53:33Z

src/transformers/trainer.py

+        # Set generation-related kwargs
+        if self.args.predict_with_generate:
+            if self.args.generation_config is not None:
+                gen_config = self.args.generation_config
+                self.gen_config = copy.deepcopy(gen_config)  # copy so we don't modify args.gen_config in-place
+                unused_kwargs = self.gen_config.update(**gen_kwargs)
+                if unused_kwargs:
+                    logger.warning_once(
+                        f"Following generation related kwargs were passed to `evaluate` but not used by `generate()`: "
+                        f"{' '.join(unused_kwargs.keys())} .",
+                        "Make sure there are no typos in the passed kwargs or do not pass unused kwargs.",
+                    )
+            else:
+                # We assume the model can generate if predict-with-generate is True
+                # Therefore, generation_config should be available
+                self.gen_config = self.model.generation_config
+


same comment here

SunMarc · 2024-08-30T13:56:23Z

src/transformers/training_args.py

+    generation_config: Optional[GenerationConfig] = field(
+        default=None,
+        metadata={
+            "help": (
+                "The GenerationConfig that will be used during prediction. Args from this config ",
+                "will have higher priority than model's generation config. Anything not set by this config ",
+                "will fallback to `model.generation_config`.",
+            )
+        },
+    )


I think we simplify a bit things if we also add a generation_kwargs as this is incompatible with generation_config + I don't think we want to merge both arguments into one. WDYT @muellerzr ?

Hmm, maybe we can then allow users to pass generation_config as a dict also, then we can make a Config object of it ourselves. I see that TrainerSeq2Seq args also uses a config arg, so I thought we could later merge seq2seq args with trainerArgs

I would be better I think ! This way, we won't need to have **gen_kwargs in evaluate and predict function. cc @muellerzr @gante

oke, now we can accept a dict or a config object in training args

qiuosier · 2024-08-31T17:33:07Z

Hi @zucchini-nlp ! Thank you for adding this PR. I have been testing it and I have a few questions/thoughts:

When DeepSpeed zero3 is enabled, I am hitting the error mentioned in Bugfix for generation with an early-stopping process #32641 during the model.generate(). Hope we can merge that as well @SunMarc.
For the slower performance, I think this is expected since the new calls to model.gernerate() involves iterative decoding, which may have many forward passes. My understanding is, one forward pass for each new token, while the existing call to model() involves only single forward pass for each input sample.
For packing in SFT trainer, I think we can set packing=True and eval_packing=False to do packing for the training dataset only.
In your example code, I think you intended to get generation_input_ids from texts_eval (instead of texts). Otherwise, as we can see in the example outputs from @salrowili, every prediction includes the entire label.

Co-authored-by: Marc Sun <[email protected]>

zucchini-nlp · 2024-09-04T11:08:24Z

@qiuosier Yes, in SFT one can pack train dataset and not pack the evaluation. I am not 100% sure it work with SFT out-of-the-box, since afair SFT doesn't accept the train_with_predict arg. We can work on adding SFT support after this PR is merged

In your example code, I think you intended to get generation_input_ids from texts_eval (instead of texts).

Oh, right, a typo hehe

qiuosier · 2024-09-04T20:44:44Z

Hi @zucchini-nlp , the SFTConfig(args argument), is a subclass of TrainingArguments, so once your version of transformers is installed, we will be able to use the predict_with_generate in the args for in the SFTTrainer. I have to set dataset_kwargs={"skip_prepare_dataset": True} to customize the data preparation in the __init__() and use a customized data collator though. Because by default, the SFTTrainer will tokenize (and pack) the data in the __init__() call and the data collator is used for padding only. I think tokenizing the data in __init__() is more efficient as we only need to do it once, instead of doing it every epoch in the data collator.

zucchini-nlp · 2024-09-05T08:01:33Z

@qiuosier oke, cool, then it might work out-of-the-box. Didn't really test it yet

pdufour · 2024-11-01T21:13:15Z

Any chance this can be merged?

pdufour · 2024-11-03T17:22:13Z

@zucchini-nlp I've been testing your branch and there were a couple issues that I fixed in regards to using it with SFTTrainer:

These can be landed after the transformers PR lands and a new version is released. I've tried to help you and re-merge in main into this branch, but when I made a PR it was not what I thought it'd be - zucchini-nlp#1. There weren't many changes though, just a couple conflicts. Let me know if there's anything I can do to help.

zucchini-nlp · 2024-11-04T09:45:16Z

@pdufour hey! Thanks for opening PRs to integrate the feature in TRL. I agree that we should first get this merged in transformers.

For the progress, I am waiting for reviews from @muellerz . Can you review pls? 🤗 I will rebase main in a few hours

R4ZZ3 · 2024-11-05T20:33:49Z

Hopefully this get's merged soon.
I thought it would be easy to implement some custom_metrics to calculate like mtbench scores at every eval step as previously something like bleu/wer were easy to calculate using just evaluate library

tcz · 2024-11-26T12:13:27Z

So am I correct to assume that all existing collators, e.g. DataCollatorForCompletionOnlyLM will need to be modified so that they contain generation_input_ids and generation_attention_mask?

If so, it's less than ideal.

zucchini-nlp and others added 2 commits July 30, 2024 14:14

commit

2265ea8

Merge branch 'huggingface:main' into train_predict

c45fab2

nits & add ignore-keys in VLMs

3c11a1c

zucchini-nlp requested a review from LysandreJik July 31, 2024 07:20

fix tests

6832a15

zucchini-nlp mentioned this pull request Aug 2, 2024

Trainer: To keep unused columns for compute_metrics #31570

Closed

qiuosier reviewed Aug 19, 2024

View reviewed changes

src/transformers/trainer.py Outdated Show resolved Hide resolved

zucchini-nlp added 2 commits August 21, 2024 06:54

typo

0959b5a

tmp

0fab8d7

qiuosier reviewed Aug 28, 2024

View reviewed changes

zucchini-nlp added 2 commits August 30, 2024 12:18

use wrapped model

19638d9

Merge remote-tracking branch 'upstream/main' into train_predict

5e6f0e9

zucchini-nlp requested a review from muellerzr August 30, 2024 10:52

SunMarc reviewed Aug 30, 2024

View reviewed changes

zucchini-nlp and others added 4 commits September 2, 2024 13:39

Update src/transformers/trainer.py

d66818a

Co-authored-by: Marc Sun <[email protected]>

Update src/transformers/trainer.py

2aedb2d

Co-authored-by: Marc Sun <[email protected]>

Update src/transformers/trainer.py

d510046

Co-authored-by: Marc Sun <[email protected]>

remove files

7dc7781

zucchini-nlp added 3 commits September 2, 2024 19:06

address comments

e21ba68

add example & docstring nits

3e5411c

docs

e342f8b

zucchini-nlp added 3 commits September 4, 2024 13:32

Merge remote-tracking branch 'upstream/main' into train_predict

529e888

fix tests

bddd18b

Merge remote-tracking branch 'upstream/main' into train_predict

ae6000c

zucchini-nlp mentioned this pull request Sep 10, 2024

A Trainer subclass for Decoder-Only LM with generation in evaluate() #33396

Open

This was referenced Nov 3, 2024

[Draft] Add autocast to prediction_step for SFTTrainer huggingface/trl#2310

Draft

[Draft] Add eval_data_collator arg huggingface/trl#2311

Draft

tcz mentioned this pull request Nov 25, 2024

compute_metrics with causal LM training #26474

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainer: add predict with generate #32346

Trainer: add predict with generate #32346

zucchini-nlp commented Jul 31, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Jul 31, 2024

LysandreJik commented Jul 31, 2024

salrowili commented Aug 1, 2024

zucchini-nlp commented Aug 1, 2024

salrowili commented Aug 2, 2024 •

edited

Loading

zucchini-nlp commented Aug 2, 2024

zucchini-nlp commented Aug 16, 2024

salrowili commented Aug 16, 2024

zucchini-nlp commented Aug 16, 2024

shubhamjain0594 commented Aug 28, 2024 •

edited

Loading

qiuosier Aug 21, 2024

zucchini-nlp Aug 30, 2024

zucchini-nlp commented Aug 30, 2024

SunMarc left a comment •

edited

Loading

SunMarc Aug 30, 2024 •

edited

Loading

SunMarc Aug 30, 2024

SunMarc Aug 30, 2024

zucchini-nlp Sep 2, 2024

SunMarc Sep 2, 2024

zucchini-nlp Sep 2, 2024

SunMarc Aug 30, 2024

SunMarc Aug 30, 2024

zucchini-nlp Sep 2, 2024

SunMarc Sep 2, 2024

zucchini-nlp Sep 3, 2024

qiuosier commented Aug 31, 2024 •

edited

Loading

zucchini-nlp commented Sep 4, 2024

qiuosier commented Sep 4, 2024 •

edited

Loading

zucchini-nlp commented Sep 5, 2024

pdufour commented Nov 1, 2024

pdufour commented Nov 3, 2024 •

edited

Loading

zucchini-nlp commented Nov 4, 2024

R4ZZ3 commented Nov 5, 2024 •

edited

Loading

tcz commented Nov 26, 2024

	# Note: in torch.distributed mode, there's no point in wrapping the model
	# inside a DistributedDataParallel as we'll be under `no_grad` anyways.
	if not training:
	return model

		# Therefore, generation_config should be available
		self.gen_config = self.model.generation_config

Trainer: add predict with generate #32346

Are you sure you want to change the base?

Trainer: add predict with generate #32346

Conversation

zucchini-nlp commented Jul 31, 2024 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Jul 31, 2024

LysandreJik commented Jul 31, 2024

salrowili commented Aug 1, 2024

zucchini-nlp commented Aug 1, 2024

salrowili commented Aug 2, 2024 • edited Loading

zucchini-nlp commented Aug 2, 2024

zucchini-nlp commented Aug 16, 2024

salrowili commented Aug 16, 2024

zucchini-nlp commented Aug 16, 2024

shubhamjain0594 commented Aug 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zucchini-nlp commented Aug 30, 2024

SunMarc left a comment • edited Loading

Choose a reason for hiding this comment

SunMarc Aug 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qiuosier commented Aug 31, 2024 • edited Loading

zucchini-nlp commented Sep 4, 2024

qiuosier commented Sep 4, 2024 • edited Loading

zucchini-nlp commented Sep 5, 2024

pdufour commented Nov 1, 2024

pdufour commented Nov 3, 2024 • edited Loading

zucchini-nlp commented Nov 4, 2024

R4ZZ3 commented Nov 5, 2024 • edited Loading

tcz commented Nov 26, 2024

zucchini-nlp commented Jul 31, 2024 •

edited

Loading

salrowili commented Aug 2, 2024 •

edited

Loading

shubhamjain0594 commented Aug 28, 2024 •

edited

Loading

SunMarc left a comment •

edited

Loading

SunMarc Aug 30, 2024 •

edited

Loading

qiuosier commented Aug 31, 2024 •

edited

Loading

qiuosier commented Sep 4, 2024 •

edited

Loading

pdufour commented Nov 3, 2024 •

edited

Loading

R4ZZ3 commented Nov 5, 2024 •

edited

Loading