Model inputs prior to cross entropy loss masking #590

chimezie · 2024-01-30T19:01:38Z

chimezie
Jan 30, 2024

I'm trying to implement the equivalent of HF's ability to train on completions only using MLX. Looking at the default implementation of iterate_batches and default loss in mlx_lm.tuner.trainer, it looks as if the tokens are being set to zero for the padding suffix used to ensure each sequence of tokens in the batch is of the same maximal length. Then, in default_loss, a boolean mask is used to avoid penalizing the model for not generating the padding.

In my attempt to generalize from this approach, I'm using the module below (which uses #391 to pass a custom loss and batching function) to test this on the SQL generation training dataset included with mlx-llm.

It assumes, with the following training text, as an example:

table: 1-1000181-1
columns: State/territory, Text/background colour, Format, Current slogan, Current series, Notes
Q: Tell me what the notes are for South Australia
A: SELECT Notes FROM 1-1000181-1 WHERE Current slogan = 'SOUTH AUSTRALIA'"

that the input is

table: 1-1000181-1
columns: State/territory, Text/background colour, Format, Current slogan, Current series, Notes
Q: Tell me what the notes are for South Australia

and the output it

A: SELECT Notes FROM 1-1000181-1 WHERE Current slogan = 'SOUTH AUSTRALIA'"

The custom iterate_batches function calculates the length of the tokenized 'input' for each record in the batch, fills in zeros for the tokens up to the length of the input as well as the padding suffix (leaving only the completion ids with non-zero tokens), and passes a list of the input lengths along with the batch and the full lengths to the custom loss function. The custom loss function then calculates a mask for ignoring the inputs and the padding suffix.

However, when I run this, I'm getting NaN error values:

Starting training..., iters: 50
Iter 1: Val loss nan, Val took 24.561s
Iter 10: Train loss nan, It/sec 0.786, Tokens/sec 109.620
Iter 20: Train loss nan, It/sec 0.675, Tokens/sec 93.529
Iter 30: Train loss nan, It/sec 0.593, Tokens/sec 82.595

But when I change the custom batching function to fill in the actual values of the tokenized input (rather than use zeros as is the case for the suffix), i.e., from

batch_arr[j, input_length:full_ids_end_idx] = full_labels[j][input_length:full_ids_end_idx]

to

batch_arr[j, :full_ids_end_idx] = full_labels[j][:full_ids_end_idx]

Then I get proper loss values:

Starting training..., iters: 50
Iter 1: Val loss 17.748, Val took 23.829s
Iter 10: Train loss 15.490, It/sec 0.772, Tokens/sec 103.148
Iter 20: Train loss 8.330, It/sec 0.704, Tokens/sec 96.032
Iter 30: Train loss 3.902, It/sec 0.636, Tokens/sec 87.733

Is there a reason why using zeros for the the front of the tokens that will be subject to the mask same as the suffix padding would cause NaN error values? Note, the custom batching function is not performing the token shift (described here and implemented in mlx_lm's default iterate_batches method) and I'm not sure if that is related to the cause of this issue, but changing the loss method to the following to facilitate the same shift did not address the issue:

def completions_only_loss(model, inputs, input_lengths, lengths):
    shifted_inputs = inputs[:, :-1]
    shifted_labels = inputs[:, 1:]
    logits, _ = model(shifted_inputs)
    logits = logits.astype(mx.float32)

    mask_width = shifted_inputs.shape[1]
    batch_size = shifted_inputs.shape[0]
    mask = mx.full((batch_size, mask_width), True)
    for r in range(batch_size):
        for c in range(mask_width):
            mask[r, c] = (c >= input_lengths[r]) and (c < lengths[r])
    #                     ignore input tokens        ignore suffix padding
    ce = nn.losses.cross_entropy(logits, shifted_labels) * mask
    ntoks = mask.sum()
    ce = ce.sum() / ntoks
    return ce, ntoks

Any insight to help my understanding of this would be greatly appreciated. Thank you for such a great software framework

import mlx.optimizers as optim
import numpy as np

from mlx_lm.tuner.lora import LoRALinear
from mlx_lm.tuner.trainer import TrainingArgs, train
from mlx_lm.utils import load
from mlx_lm import lora
from types import SimpleNamespace
import mlx.core as mx
import mlx.nn as nn


def completions_only_loss(model, inputs, input_lengths, lengths):
    logits, _ = model(inputs)
    logits = logits.astype(mx.float32)

    mask_width = inputs.shape[1]
    batch_size = inputs.shape[0]
    mask = mx.full((batch_size, mask_width), True)
    for r in range(batch_size):
        for c in range(mask_width):
            mask[r, c] = (c >= input_lengths[r]) and (c < lengths[r])
    #                     ignore input tokens        ignore suffix padding
    ce = nn.losses.cross_entropy(logits, inputs) * mask
    ntoks = mask.sum()
    ce = ce.sum() / ntoks
    return ce, ntoks


def completions_only_iterate_batches(dataset, tokenizer, batch_size, max_seq_length, train=False):
    while True:
        indices = np.random.permutation(np.arange(len(dataset)))
        for i in range(0, len(indices) - batch_size + 1, batch_size):
            input_text = []
            output_text = []
            for j in range(batch_size):
                record = dataset[indices[i + j]]
                table_info, column_info, question_text, answer_text = record.split('\n')
                input_text.append('\n'.join([table_info, column_info, question_text]))
                output_text.append(answer_text)

            input_batch = [tokenizer.encode(record) for record in input_text]
            output_batch = [tokenizer.encode(record, add_special_tokens=False) +
                            [tokenizer.eos_token_id] for record in output_text]

            input_lengths = [len(x) for x in input_batch]
            output_lengths = [len(x) for x in output_batch]

            full_labels = [input_batch[idx] + output_batch[idx] for idx in range(batch_size)]
            lengths = [len(x) for x in full_labels]

            max_width = max(lengths)
            assert max_width < 2048

            batch_arr = np.zeros((batch_size, max_width), np.int32)
            for j in range(batch_size):
                input_length = input_lengths[j]
                full_ids_end_idx = input_length + output_lengths[j]
                batch_arr[j, input_length:full_ids_end_idx] = full_labels[j][input_length:full_ids_end_idx]
            batch = mx.array(batch_arr)
            yield batch, input_lengths, lengths

        if not train:
            break


if __name__ == "__main__":
    args = SimpleNamespace(lora_layers=16, batch_size=4, iters=50, val_batches=25,
                           steps_per_report=10, steps_per_eval=200, save_every=100,
                           adapter_file='/tmp/adapter.npz', learning_rate=1e-5,
                           data='/path/to/lora/data',
                           train=True, test=False)
    print("Loading pretrained model")
    model, tokenizer = load('/path/to/local/model')

    model.freeze()
    for l in model.model.layers[len(model.model.layers) - args.lora_layers :]:
        l.self_attn.q_proj = LoRALinear.from_linear(l.self_attn.q_proj)
        l.self_attn.v_proj = LoRALinear.from_linear(l.self_attn.v_proj)
        if hasattr(l, "block_sparse_moe"):
            l.block_sparse_moe.gate = LoRALinear.from_linear(l.block_sparse_moe.gate)

    print("Loading datasets")
    train_set, valid_set, test_set = lora.load_dataset(args)

    trainingArgs = TrainingArgs(
        batch_size=args.batch_size,
        iters=args.iters,
        val_batches=args.val_batches,
        steps_per_report=args.steps_per_report,
        steps_per_eval=args.steps_per_eval,
        steps_per_save=args.save_every,
        adapter_file=args.adapter_file,
    )
    print("Training")
    model.train()
    opt = optim.Adam(learning_rate=args.learning_rate)
    train(
        model=model,
        tokenizer=tokenizer,
        args=trainingArgs,
        optimizer=opt,
        train_dataset=train_set,
        val_dataset=valid_set,
        loss=completions_only_loss,
        iterate_batches=completions_only_iterate_batches
    )

angeloskath · 2024-01-31T00:01:56Z

angeloskath
Jan 31, 2024
Maintainer

The inputs should be exactly the same as before. The only thing that needs to change is the mask for the loss that ignores the prefix and only calculates the loss on the answer.

By the way you can compute that mask without a loop by doing

token_indices = mx.arange(mask_width)[None, :]
mask = mx.logical_and(token_indices >= input_lengths[:, None], token_indices < lengths[:, None])

0 replies

chimezie · 2024-02-06T04:38:06Z

chimezie
Feb 6, 2024
Author

Thank you

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model inputs prior to cross entropy loss masking #590

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Model inputs prior to cross entropy loss masking #590

chimezie Jan 30, 2024

Replies: 2 comments

angeloskath Jan 31, 2024 Maintainer

chimezie Feb 6, 2024 Author

chimezie
Jan 30, 2024

angeloskath
Jan 31, 2024
Maintainer

chimezie
Feb 6, 2024
Author