Question on calculating label loss, i.e., exclude instruction and input from loss calculation, #1409

salokr · 2024-12-10T03:25:36Z

Hi,

Thank you for availing the awesome library.

I wanted to confirm if unsloth implicitly calculates the label loss, i.e., it masks out the instruction and the input when you train, say LLaMA, for a completion task.
Formally, I want to implement the following objective function:

Here:

$$s_i$$ represents the source instance (instruction + input) from the dataset.
$$y_{ij}$$ corresponds to the target tokens.

Here is my current code:

model_name_or_path = "unsloth/llama-3-8b-Instruct-bnb-4bit"
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    logger.info(f"Using device: {device}")
    
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = model_name_or_path, max_seq_length=args.max_seq_length, load_in_4bit=args.load_in_4bit, dtype = None, cache_dir = "./original_models")
    model = prepare_lora_model(args, model)

    full_dataset = load_dataset(args.dataset_name)
    train_dataset, val_dataset, test_dataset = full_dataset["train_dataset"], full_dataset["valid_dataset"], full_dataset["test_dataset"]
    train_dataset_copy, val_dataset_copy, test_dataset_copy = deepcopy(train_dataset), deepcopy(val_dataset), deepcopy(test_dataset)
    tokenizer = get_chat_template(
        tokenizer,
        chat_template = "llama-3.1",
        )

    train_dataset = map_dataset_to_template(dataset = train_dataset, tokenizer = tokenizer)
    val_dataset = map_dataset_to_template(dataset = val_dataset , tokenizer = tokenizer, dataset_type = "validation")
    test_dataset = map_dataset_to_template(dataset = test_dataset, tokenizer = tokenizer, dataset_type = "test")

trainer = LLaMATrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        dataset_text_field="text",
        max_seq_length=args.max_seq_length,
        data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
        dataset_num_proc=1,
        packing=False,
        args=TrainingArguments(
            run_name = f"saurabh_{args.dataset_name}_{model_name_or_path}",
            fp16_full_eval=True,
            eval_accumulation_steps = 4,
            per_device_train_batch_size=args.per_device_train_batch_size,
            per_device_eval_batch_size=args.per_device_eval_batch_size,
            learning_rate=args.learning_rate,
            lr_scheduler_type=args.lr_scheduler_type,
            num_train_epochs=10,  # Use args or dynamic based on early stopping
            evaluation_strategy="epoch",#steps or epoch
            save_strategy = "epoch",
            eval_steps=1,
            save_steps=1,
            fp16=not is_bfloat16_supported(),
            bf16=is_bfloat16_supported(),
            output_dir=f"checkpoints/inst_{args.dataset_name}_rank_{args.lora_rank}_alpha_{args.lora_alpha}_lr_{args.learning_rate}",
            logging_steps=4,
            optim="adamw_8bit",
            report_to=["wandb"],  # Or tensorboard
            logging_dir="logs",
            weight_decay=0.01,
            warmup_steps = 350,
            seed = 1337,
            load_best_model_at_end = True,
            save_total_limit=10,
            include_inputs_for_metrics = True,
            metric_for_best_model = "eval_arg_cls_f1",
        ),
        compute_metrics = load_dataset_metrics(args.dataset_name),#compute_metrics_wrapper(val_dataset, tokenizer),
        callbacks=[
            EarlyStoppingCallback(early_stopping_patience=args.patience),
            # WandbCallback()  # Add more callbacks if needed
        ]
    )

    trainer = train_on_responses_only(trainer, 
        instruction_part = instruction_response_mapper["LLaMA3_1"]["instruction_part"],
        response_part = instruction_response_mapper["LLaMA3_1"]["response_part"]
        )
    trainer.train()

I am assuming that train_on_responses_only should take care of my requirement based on the documentation but wanted to confirm if it will, say, assign -100 to instruction and input and will only send logits corresponding to target tokens for loss calculation. In addition, how can we mask PAD_TOKENS from loss calculation?

If not, how one can go about implementing it ourselves?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on calculating label loss, i.e., exclude instruction and input from loss calculation, #1409

Question on calculating label loss, i.e., exclude instruction and input from loss calculation, #1409

salokr commented Dec 10, 2024 •

edited

Loading

Question on calculating label loss, i.e., exclude instruction and input from loss calculation, #1409

Question on calculating label loss, i.e., exclude instruction and input from loss calculation, #1409

Comments

salokr commented Dec 10, 2024 • edited Loading

salokr commented Dec 10, 2024 •

edited

Loading