Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference problems after fine-tuning Gemma2 #1394

Open
trnq-eu opened this issue Dec 7, 2024 · 3 comments
Open

Inference problems after fine-tuning Gemma2 #1394

trnq-eu opened this issue Dec 7, 2024 · 3 comments

Comments

@trnq-eu
Copy link

trnq-eu commented Dec 7, 2024

Hi.
I've been fine-tuning Gemma-2-2B-it on Google Colab, saved the fine-tuned model to Huggingface.
When I load the model from Huggingface hub, I keep getting inference errors.

`from unsloth import FastLanguageModel

FastLanguageModel.for_inference(model)
prompt = "Instruction:\n{instruction}\n\nResponse:\n{response}"

inputs = tokenizer([
prompt.format(
instruction="Scrivi una frase sul tema della giustizia nello stile della rivista illuminista Il Caffé",
response="")], return_tensors="pt").to('cuda')

outputs = model.generate(**inputs, max_new_tokens = 256, use_cache = True)
print(tokenizer.batch_decode(outputs)[0])`

Error:
`/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
2549 # remove once script supports set_grad_enabled
2550 no_grad_embedding_renorm(weight, input, max_norm, norm_type)
-> 2551 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
2552
2553

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)`

If i move the model to cuda, with:

model = model.to('cuda')

I get:

7 frames
/usr/local/lib/python3.10/dist-packages/transformers/models/gemma2/modeling_gemma2.py in _prepare_4d_causal_attention_mask_with_cache_position(attention_mask, sequence_length, target_length, dtype, device, cache_position, batch_size, **kwargs)
948 causal_mask = causal_mask.clone() # copy to contiguous memory for in-place edit
949 mask_length = attention_mask.shape[-1]
--> 950 padding_mask = causal_mask[:, :, :, :mask_length] + attention_mask[:, None, None, :]
951 padding_mask = padding_mask == 0
952 causal_mask[:, :, :, :mask_length] = causal_mask[:, :, :, :mask_length].masked_fill(

RuntimeError: The size of tensor a (25) must match the size of tensor b (26) at non-singleton dimension 3

Thanks

@trnq-eu
Copy link
Author

trnq-eu commented Dec 7, 2024

Seem to have solved with use_cache=False but I don't understand why

outputs = model.generate(**inputs, max_new_tokens = 256, use_cache = False)

@Erland366
Copy link
Contributor

I can't seems to reproduce it, did you use specific settings when training your models?

@trnq-eu
Copy link
Author

trnq-eu commented Dec 8, 2024

These are the settings I used:

trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset["train"], # Usa il dataset corretto
dataset_text_field="text", # Correggi il campo del testo
max_seq_length=2048,
# max_seq_length=128,
dataset_num_proc=2,
packing=True,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=5,
max_steps=60,
learning_rate=2e-4,
fp16=not is_bfloat16_supported(),
bf16=is_bfloat16_supported(),
logging_steps=1,
optim="adamw_8bit",
weight_decay=0.01,
lr_scheduler_type="linear",
seed=42,
output_dir="outputs",
),
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants