Inference problems after fine-tuning Gemma2 #1394

trnq-eu · 2024-12-07T06:39:48Z

Hi.
I've been fine-tuning Gemma-2-2B-it on Google Colab, saved the fine-tuned model to Huggingface.
When I load the model from Huggingface hub, I keep getting inference errors.

`from unsloth import FastLanguageModel

FastLanguageModel.for_inference(model)
prompt = "Instruction:\n{instruction}\n\nResponse:\n{response}"

inputs = tokenizer([
prompt.format(
instruction="Scrivi una frase sul tema della giustizia nello stile della rivista illuminista Il Caffé",
response="")], return_tensors="pt").to('cuda')

outputs = model.generate(**inputs, max_new_tokens = 256, use_cache = True)
print(tokenizer.batch_decode(outputs)[0])`

Error:
`/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
2549 # remove once script supports set_grad_enabled
2550 no_grad_embedding_renorm(weight, input, max_norm, norm_type)
-> 2551 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
2552
2553

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)`

If i move the model to cuda, with:

model = model.to('cuda')

I get:

7 frames
/usr/local/lib/python3.10/dist-packages/transformers/models/gemma2/modeling_gemma2.py in _prepare_4d_causal_attention_mask_with_cache_position(attention_mask, sequence_length, target_length, dtype, device, cache_position, batch_size, **kwargs)
948 causal_mask = causal_mask.clone() # copy to contiguous memory for in-place edit
949 mask_length = attention_mask.shape[-1]
--> 950 padding_mask = causal_mask[:, :, :, :mask_length] + attention_mask[:, None, None, :]
951 padding_mask = padding_mask == 0
952 causal_mask[:, :, :, :mask_length] = causal_mask[:, :, :, :mask_length].masked_fill(

RuntimeError: The size of tensor a (25) must match the size of tensor b (26) at non-singleton dimension 3

Thanks

The text was updated successfully, but these errors were encountered:

trnq-eu · 2024-12-07T06:48:56Z

Seem to have solved with use_cache=False but I don't understand why

outputs = model.generate(**inputs, max_new_tokens = 256, use_cache = False)

Erland366 · 2024-12-07T17:36:59Z

I can't seems to reproduce it, did you use specific settings when training your models?

trnq-eu · 2024-12-08T06:08:26Z

These are the settings I used:

trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset["train"], # Usa il dataset corretto
dataset_text_field="text", # Correggi il campo del testo
max_seq_length=2048,
# max_seq_length=128,
dataset_num_proc=2,
packing=True,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=5,
max_steps=60,
learning_rate=2e-4,
fp16=not is_bfloat16_supported(),
bf16=is_bfloat16_supported(),
logging_steps=1,
optim="adamw_8bit",
weight_decay=0.01,
lr_scheduler_type="linear",
seed=42,
output_dir="outputs",
),
)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference problems after fine-tuning Gemma2 #1394

Inference problems after fine-tuning Gemma2 #1394

trnq-eu commented Dec 7, 2024

trnq-eu commented Dec 7, 2024

Erland366 commented Dec 7, 2024

trnq-eu commented Dec 8, 2024

Inference problems after fine-tuning Gemma2 #1394

Inference problems after fine-tuning Gemma2 #1394

Comments

trnq-eu commented Dec 7, 2024

trnq-eu commented Dec 7, 2024

Erland366 commented Dec 7, 2024

trnq-eu commented Dec 8, 2024