-
Notifications
You must be signed in to change notification settings - Fork 27.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trainer: add predict with generate #32346
base: main
Are you sure you want to change the base?
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
This seems good to me from a quick read, pinging @SunMarc and @muellerzr who have more experience with that code than I do. |
Can you please show an example how this code can work with HF text dataset (not the multimodal dataset) without Idefics2 processor? I mean using tokenizer.apply_chat_template ? how right and left padding would be handle in this case? |
@salrowili it should be similar to Idefics with the only difference that instead of Below is a modified version of Idefics script, should work for text models class DataCollatorForGeneration:
def __init__(self, tokenizer, eval_mode=False):
self.tokenizer = tokenizer
self.eval_mode = eval_mode
def __call__(self, examples):
texts, texts_eval = [], []
images = []
for example in examples:
question = example["query"]["en"]
answer = random.choice(example["answers"])
messages = [
{
"role": "user",
"content": f"Answer question: {question}"
},
{
"role": "assistant",
"content": answer
}
]
text = tokenizer.apply_chat_template(messages, add_generation_prompt=False)
text_eval = tokenizer.apply_chat_template([messages[0]], add_generation_prompt=True)
texts.append(text.strip())
texts_eval.append(text_eval.strip())
images.append([image])
# Make sure we have right padding in train and left padding for eval parts
tokenizer.padding_side = "right"
batch = tokenizer(text=texts, return_tensors="pt", padding=True)
if self.eval_mode:
tokenizer.padding_side = "left"
batch_eval = tokenizer(text=texts, return_tensors="pt", padding=True)
batch['generation_input_ids'] = batch_eval['input_ids']
batch['generation_attention_mask'] = batch_eval['attention_mask']
labels = batch["input_ids"].clone()
labels[labels == tokenizer.pad_token_id] = -100 # Ignore index for CE-loss
batch["labels"] = labels |
@zucchini-nlp Thank you for the update. I have added some lines to the code to make a complete example for QA task. from transformers import AutoModelForCausalLM, AutoTokenizer,BitsAndBytesConfig,TrainingArguments, Trainer
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer, DataCollatorForCompletionOnlyLM
from peft import LoraConfig
import torch
from torchmetrics.text import SQuAD
from random import randrange
from transformers.utils import logging
dataset = load_dataset("Stanford/web_questions")
train_dataset=dataset["train"]
eval_dataset=dataset["test"]
eval_dataset = eval_dataset.select(range(256))
quant_config=BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_type=torch.bfloat16
)
model_id="meta-llama/Meta-Llama-3.1-8B"
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=quant_config,
device_map="auto",
torch_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.add_special_tokens({"pad_token":"</s>"})
pad_token_id = tokenizer.pad_token_id
model.resize_token_embeddings(len(tokenizer),pad_to_multiple_of=8)
gen_config=gen_config = model.generation_config
gen_config.max_length = 256
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model.add_adapter(peft_config)
model.enable_adapters()
tokenizer.chat_template = "{% for message in messages %}{% if message['role'] == 'user' %}{{ ' ' }}{% endif %}{{ message['content'] }}{% if not loop.last %}{{ ' ' }}{% endif %}{% endfor %}{{ eos_token }}"
class DataCollatorForGeneration:
def __init__(self, tokenizer, eval_mode=False):
self.tokenizer = tokenizer
self.eval_mode = eval_mode
def __call__(self, examples):
texts, texts_eval = [], []
for example in examples:
question = example["question"]
answer = example["answers"][0] ### webquestion dataset has multiple answers so to make the code simple we choost the first answer
messages = [
{
"role": "user",
"content": f"Answer the following question: {question}"
},
{
"role": "assistant",
"content": answer
}
]
text = tokenizer.apply_chat_template(messages, add_generation_prompt=False,tokenize=False)
text_eval = tokenizer.apply_chat_template([messages[0]],add_generation_prompt=True,tokenize=False)
texts.append(text.strip())
texts_eval.append(text_eval.strip())
## uncomment to check template format
# print(text)
#print(text_eval)
#exit()
# Make sure we have right padding in train and left padding for eval parts
tokenizer.padding_side = "right"
batch = tokenizer(text=texts, return_tensors="pt", padding=True)
if self.eval_mode:
tokenizer.padding_side = "left"
batch_eval = tokenizer(text=texts_eval, return_tensors="pt", padding=True)
batch['generation_input_ids'] = batch_eval['input_ids']
batch['generation_attention_mask'] = batch_eval['attention_mask']
labels = batch["input_ids"].clone()
labels[labels == tokenizer.pad_token_id] = -100 # Ignore index for CE-loss
batch["labels"] = labels
return batch
def custom_metrics(prediction_dict):
# unmask for correct detokenization, because preds are padded to max length with -100
preds = prediction_dict.predictions
preds[preds == -100] = pad_token_id
lbls = prediction_dict.label_ids
lbls[lbls == -100] = pad_token_id
# Decode and do magic for metrics
preds = tokenizer.batch_decode(preds,skip_special_tokens=True)
lbls = tokenizer.batch_decode(lbls,skip_special_tokens=True)
## uncomment if you want to see all special token (e.g, EOS)
#preds = tokenizer.batch_decode(preds)
#lbls = tokenizer.batch_decode(lbls)
print("\n\n\n",'=' * 40,"Labels",'=' * 40)
for item_x in lbls[:5]:
print(item_x,"\n")
print("\n",'=' * 40,"Predictions",'=' * 40)
for item_x in preds[:5]:
print(item_x,"\n")
print("\n",'=' * 80)
pred_list=[]
label_list=[]
idx=0
## visit https://lightning.ai/docs/torchmetrics/stable/text/squad.html for reference ##
for x,y in zip(preds,lbls):
pred_list.append({"prediction_text": x.split("?")[1], "id": idx})
label_list.append({"answers": {"text": [y.split("?")[1]]}, "id": idx})
squad = SQuAD()(pred_list,label_list)
em_score=squad["exact_match"].item()
f1_score=squad["f1"].item()
idx+=1
return {"exact_match" : em_score, "f1_score": f1_score}
def preprocess_logits_for_metrics(logits, labels):
"""Helper function for logits preprocessing for metrics"""
preds = torch.argmax(logits, dim=-1)
return preds, labels
training_args = TrainingArguments(
per_device_train_batch_size=8,
per_device_eval_batch_size=128,
num_train_epochs=20,
do_train=True,
do_eval=True,
eval_strategy="steps",
eval_steps=500,
save_steps=500000,
bf16=True,
output_dir="./test_predict",
overwrite_output_dir=True,
optim="adafactor",
report_to="none",
logging_steps=100000,
remove_unused_columns=False,
predict_with_generate=True,
generation_config=gen_config)
trainer = Trainer(
model=model,
args=training_args,
data_collator=DataCollatorForGeneration(tokenizer),
eval_data_collator=DataCollatorForGeneration(tokenizer, eval_mode=True),
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=custom_metrics,
)
trainer.train() Here is my comments and question to think about:
The output for this code is pretty decent :
Thank you |
Thanks for feedback and testing the feature!
|
Seems like there's not much we can do about long evaluation time when generating. I tried to track how long it takes with decoder-only and encoder-decoder models. Indeed most encoder-decoder models are fast mainly because they're lightweight while the model you tried with is 8B parameters. I did several checks to verify that the evaluation speed is approximately same when we have models of similar size. Increasing batch size is one option to generate faster as you tried already. Another option is to generate only on a small sample of the eval set, and let users enable generation on the whole dataset if they want to Also, logging on WandB is working for me and the generation config is logged as a simple dict, can you share what errors you got there @salrowili ? |
Hi @zucchini-nlp . When i state that the prediction is slow i compare it to this script here https://huggingface.co/docs/trl/en/sft_trainer, which is much faster. I think one possible way to solve this problem is to integrate your code to SFTTrainer class from trl repo and see if the speed has changed. Another way is to do it through eval_packing which will group couple of example together to fill the sequence. see : https://github.com/huggingface/trl/blob/314e8eb367cbfaf74c2e9717085346360e779508/trl/trainer/sft_trainer.py#L110 . For wandb logging, to reproduce the error, just change report_to from None to wandb and you will get the error. but this issue is minor as we can overcome it by using wandb.init and wandb.log inside the code it self. |
Oke, so it's SFTTrainer, then I'll see what is different there. For packing, we can calculate loss with packing but not generate, since generation tries to continue next several tokens from sequence and in a packed sequence there might be more than one. In general, we had an idea to try out |
Thanks for this PR @zucchini-nlp, hoping it gets merged soon. I am using similar thing internally to train decoder-only models for information extraction. I saw a concern that this is slower than traditional SFT Trainer and something I experienced as well. My belief is that this might be mainly because in SFT Trainer, during prediction it predicts n+1 token using previous n tokens that come from prompt + ground truth. While in this case, previous n tokens come from prompt + predictions. So it cannot be parallelized same way as the SFT Trainer, where you can literally predict n+1, n+2... tokens in parallel. My belief comes from the fact that I saw a drop in eval performance and increase in time when using predict_with_generate, compared to using SFTTrainer as it is. |
src/transformers/trainer.py
Outdated
for k, v in generation_inputs.items() | ||
if k.replace("generation_", "") not in gen_keys and "generation" not in k | ||
} | ||
generated_tokens = self.model.generate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we use model
instead of self.model
here? In the evaluation_loop(), the self.model
is wrapped and the wrapped model
may not always be the same as self.model
. I think this is for the case when deepspeed zero3 is enabled and evalute_on_start
is set to true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For inference we don't wrap for distributed mode, but I changed for model
because there are some other steps run before returning the model. The original code was adapted from seq2seq trainer, so I modified it there too
transformers/src/transformers/trainer.py
Lines 1762 to 1765 in c409cd8
# Note: in torch.distributed mode, there's no point in wrapping the model | |
# inside a DistributedDataParallel as we'll be under `no_grad` anyways. | |
if not training: | |
return model |
@shubhamjain0594 Yes, that's exactly what I meant that generation is expected to take more time than simple forward. As per the last comment from @salrowili , I compared SFT and HFTrained, with and without generation. I don't see any slow down caused by HFTrainer specifically, as the both of them rely on the same code to do training and evalution. The only diff is that SFT support packed dataset while HFTrainer doesn't. The current implementation of I think we can let users use a small sample of the eval set for generation, if they don't want to slow down evaluation loop. Applying some generation optimization tricks here might not be the optimal solution, as we are trying to verify how good the model is learning. The only technique I can think of that can be used is torch compile, but it is still a very new feature and I would rather not integrate it in GFTrainer yet. As per WandB, it worked for me in SFT and HFTrained, the generation config is logged as a dict in parameters So, I think I can request review from @muellerz now. The PR isn't very high priority, so feel free to take a look whenever you have bandwidth |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR @zucchini-nlp! This PR looks very good. I left a few comments ! In order to simplify things, I'm thinking that it would be easier to just add a generation_kwargs
, WDYT @muellerzr @zucchini-nlp ?
synced_gpus = gen_kwargs.get("synced_gpus", default_synced_gpus) | ||
if len(gen_kwargs) > 0: | ||
unused_kwargs = gen_config.update(**gen_kwargs) | ||
if unused_kwargs: | ||
logger.warning_once( | ||
"Following generation related kwargs were passed to `prediction_step` but not " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that if you pass synced_gpus
in gen_kwargs
, the warning will appear since it will be in unused_kwargs
. Maybe do pop
instead. Also this will trigger the warning in other places also.
test_dataset: Dataset, | ||
ignore_keys: Optional[List[str]] = None, | ||
metric_key_prefix: str = "test", | ||
**gen_kwargs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you also need to add it for prediction_step
src/transformers/trainer.py
Outdated
# Therefore, generation_config should be available | ||
self.gen_config = self.model.generation_config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we also update the config with gen_kwargs ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean add kwargs from model.config
to generation config? It shouldn't be necessary because the base model.generation_config
should contain all generation related kwargs after the model is loaded. So we just need to make sure user-passed kwargs have higher priority than trainer.generation_config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm taking about the gen_kwargs
that you are passing in predict
. I would expect that self.gen_config
is updated when the user pass gen_kwargs
in the predict
function in all cases (important in the case we pass a generate kwargs such as synced_gpus
). By default, it is equal to self.model.generation_config
but if the user passes it in TrainingArguments, it will be equal to self.args.generation_config
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see now, right, we should be updating i any case
src/transformers/trainer.py
Outdated
# Set generation-related kwargs | ||
if self.args.predict_with_generate: | ||
if self.args.generation_config is not None: | ||
gen_config = self.args.generation_config | ||
self.gen_config = copy.deepcopy(gen_config) # copy so we don't modify args.gen_config in-place | ||
unused_kwargs = self.gen_config.update(**gen_kwargs) | ||
if unused_kwargs: | ||
logger.warning_once( | ||
f"Following generation related kwargs were passed to `evaluate` but not used by `generate()`: " | ||
f"{' '.join(unused_kwargs.keys())} .", | ||
"Make sure there are no typos in the passed kwargs or do not pass unused kwargs.", | ||
) | ||
else: | ||
# We assume the model can generate if predict-with-generate is True | ||
# Therefore, generation_config should be available | ||
self.gen_config = self.model.generation_config | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment here
src/transformers/training_args.py
Outdated
generation_config: Optional[GenerationConfig] = field( | ||
default=None, | ||
metadata={ | ||
"help": ( | ||
"The GenerationConfig that will be used during prediction. Args from this config ", | ||
"will have higher priority than model's generation config. Anything not set by this config ", | ||
"will fallback to `model.generation_config`.", | ||
) | ||
}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we simplify a bit things if we also add a generation_kwargs
as this is incompatible with generation_config
+ I don't think we want to merge both arguments into one. WDYT @muellerzr ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, maybe we can then allow users to pass generation_config
as a dict also, then we can make a Config object of it ourselves. I see that TrainerSeq2Seq
args also uses a config
arg, so I thought we could later merge seq2seq
args with trainerArgs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be better I think ! This way, we won't need to have **gen_kwargs in evaluate and predict function. cc @muellerzr @gante
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oke, now we can accept a dict or a config object in training args
Hi @zucchini-nlp ! Thank you for adding this PR. I have been testing it and I have a few questions/thoughts:
|
Co-authored-by: Marc Sun <[email protected]>
Co-authored-by: Marc Sun <[email protected]>
Co-authored-by: Marc Sun <[email protected]>
@qiuosier Yes, in SFT one can pack train dataset and not pack the evaluation. I am not 100% sure it work with SFT out-of-the-box, since afair SFT doesn't accept the
Oh, right, a typo hehe |
Hi @zucchini-nlp , the SFTConfig( |
@qiuosier oke, cool, then it might work out-of-the-box. Didn't really test it yet |
Any chance this can be merged? |
@zucchini-nlp I've been testing your branch and there were a couple issues that I fixed in regards to using it with SFTTrainer:
These can be landed after the transformers PR lands and a new version is released. I've tried to help you and re-merge in main into this branch, but when I made a PR it was not what I thought it'd be - zucchini-nlp#1. There weren't many changes though, just a couple conflicts. Let me know if there's anything I can do to help. |
Hopefully this get's merged soon. |
So am I correct to assume that all existing collators, e.g. If so, it's less than ideal. |
What does this PR do?
Fixes #26474, fixes #31462, fixes #33396 and fixes #31672. This PR adds possibility to generate and compute metrics on generated texts for decoder-only models.
The basic idea is almost same as in Seq2Seq Trainer, but decoder-only models need a prompt-only input for generation. While for loss computation we need the whole input. Therefore we can ask users to prepare train and eval datasets, so that the eval contains
generation_inputs
used for generation. Additionally, to make user's life easier, I added a possibility to pass in different collators for train and for eval/test datasets.The args used for generation should be set via
GenerationConfig
, as imo that makes most sense instead of adding onlymax_length
andnum_beams
as in Seq2SeqTrainer.The code was tested with the below dummy train script.