Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on calculating label loss, i.e., exclude instruction and input from loss calculation, #1409

Open
salokr opened this issue Dec 10, 2024 · 0 comments

Comments

@salokr
Copy link

salokr commented Dec 10, 2024

Hi,

Thank you for availing the awesome library.

I wanted to confirm if unsloth implicitly calculates the label loss, i.e., it masks out the instruction and the input when you train, say LLaMA, for a completion task.
Formally, I want to implement the following objective function:
image
Here:

  • $$s_i$$ represents the source instance (instruction + input) from the dataset.
  • $$y_{ij}$$ corresponds to the target tokens.

Here is my current code:

model_name_or_path = "unsloth/llama-3-8b-Instruct-bnb-4bit"
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    logger.info(f"Using device: {device}")
    
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = model_name_or_path, max_seq_length=args.max_seq_length, load_in_4bit=args.load_in_4bit, dtype = None, cache_dir = "./original_models")
    model = prepare_lora_model(args, model)

    full_dataset = load_dataset(args.dataset_name)
    train_dataset, val_dataset, test_dataset = full_dataset["train_dataset"], full_dataset["valid_dataset"], full_dataset["test_dataset"]
    train_dataset_copy, val_dataset_copy, test_dataset_copy = deepcopy(train_dataset), deepcopy(val_dataset), deepcopy(test_dataset)
    tokenizer = get_chat_template(
        tokenizer,
        chat_template = "llama-3.1",
        )

    train_dataset = map_dataset_to_template(dataset = train_dataset, tokenizer = tokenizer)
    val_dataset = map_dataset_to_template(dataset = val_dataset , tokenizer = tokenizer, dataset_type = "validation")
    test_dataset = map_dataset_to_template(dataset = test_dataset, tokenizer = tokenizer, dataset_type = "test")

trainer = LLaMATrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        dataset_text_field="text",
        max_seq_length=args.max_seq_length,
        data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
        dataset_num_proc=1,
        packing=False,
        args=TrainingArguments(
            run_name = f"saurabh_{args.dataset_name}_{model_name_or_path}",
            fp16_full_eval=True,
            eval_accumulation_steps = 4,
            per_device_train_batch_size=args.per_device_train_batch_size,
            per_device_eval_batch_size=args.per_device_eval_batch_size,
            learning_rate=args.learning_rate,
            lr_scheduler_type=args.lr_scheduler_type,
            num_train_epochs=10,  # Use args or dynamic based on early stopping
            evaluation_strategy="epoch",#steps or epoch
            save_strategy = "epoch",
            eval_steps=1,
            save_steps=1,
            fp16=not is_bfloat16_supported(),
            bf16=is_bfloat16_supported(),
            output_dir=f"checkpoints/inst_{args.dataset_name}_rank_{args.lora_rank}_alpha_{args.lora_alpha}_lr_{args.learning_rate}",
            logging_steps=4,
            optim="adamw_8bit",
            report_to=["wandb"],  # Or tensorboard
            logging_dir="logs",
            weight_decay=0.01,
            warmup_steps = 350,
            seed = 1337,
            load_best_model_at_end = True,
            save_total_limit=10,
            include_inputs_for_metrics = True,
            metric_for_best_model = "eval_arg_cls_f1",
        ),
        compute_metrics = load_dataset_metrics(args.dataset_name),#compute_metrics_wrapper(val_dataset, tokenizer),
        callbacks=[
            EarlyStoppingCallback(early_stopping_patience=args.patience),
            # WandbCallback()  # Add more callbacks if needed
        ]
    )

    trainer = train_on_responses_only(trainer, 
        instruction_part = instruction_response_mapper["LLaMA3_1"]["instruction_part"],
        response_part = instruction_response_mapper["LLaMA3_1"]["response_part"]
        )
    trainer.train()

I am assuming that train_on_responses_only should take care of my requirement based on the documentation but wanted to confirm if it will, say, assign -100 to instruction and input and will only send logits corresponding to target tokens for loss calculation. In addition, how can we mask PAD_TOKENS from loss calculation?

If not, how one can go about implementing it ourselves?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant