feat: Added judgment logic to support training with plain text data. #281

hill2hill · 2024-06-18T07:40:29Z

The current logic assumes that all input data includes image inputs, so data['pixel_values'] must match the training samples; however, if dealing with purely text data inputs, 'pixel_values' does not exist.

Here, we need to simply process the dataset to make it compatible with text input; at the same time, we need to perform an additional huggingface model merge at the model.
This addresses the following two issues, which I understand are essentially the same problem mentioned here.
#221 #250

@Cuiunbo

univa-JASON · 2024-06-19T06:27:05Z

Thank you for your achievements. However, when image-text pair data and text-only data were included in the same batch, the following error occurred when running the code.
'''
Traceback (most recent call last):
File "/workspace/VLM/Mars/finetune/finetune.py", line 250, in
train()
File "/workspace/VLM/Mars/finetune/finetune.py", line 236, in train
trainer.train()
File "/opt/miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/transformers/trainer.py", line 1859, in train
return inner_training_loop(
File "/opt/miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/transformers/trainer.py", line 2203, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/opt/miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/transformers/trainer.py", line 3138, in training_step
loss = self.compute_loss(model, inputs)
File "/workspace/VLM/Mars/finetune/trainer.py", line 20, in compute_loss
vllm_embedding, vision_hidden_states = self.model.get_vllm_embedding(inputs)
File "/root/.cache/huggingface/modules/transformers_modules/model/modeling_minicpmv.py", line 85, in get_vllm_embedding
tgt_sizes = torch.vstack(tgt_sizes).type(torch.int32)
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 0 but got size 2 for tensor number 1 in the list.
'''

hill2hill · 2024-06-19T07:00:14Z

Here's the situation: Whenever someone update the code on GitHub, this error inevitably occurs because the text data lacks corresponding tgt_sizes and cannot participate in the process of extracting image features. This part is defined within the Hugging Face model, not in this current repository. We need to add an additional precondition: as I mentioned here. We should add two lines there.

As the huggingface merge has not been accepetd by official, we can only modify the code localy.

univa-JASON · 2024-06-19T07:18:09Z

Thanks for your fast reply, but i got same error..
here is my compute loss code in local, maybe it is old version.

def compute_loss(self, model, inputs, return_outputs=False):
        if "labels" in inputs:
            labels = inputs.pop("labels")
        else:
            labels = None

        vllm_embedding, vision_hidden_states = self.model.get_vllm_embedding(inputs)
        outputs = self.model.llm(
                inputs_embeds=vllm_embedding,
                use_cache=False,
            )
        
        if labels is not None:
            loss_fct = nn.CrossEntropyLoss()
            logits = outputs.logits.view(-1, self.model.config.vocab_size).contiguous()
            labels = labels.view(-1).long().contiguous()
            labels = labels.to(logits.device)
            loss = loss_fct(logits, labels)
        else:
            if isinstance(outputs, dict) and "loss" not in outputs:
                raise ValueError(
                    "The model did not return a loss from the inputs, only the following keys: "
                    f"{','.join(outputs.keys())}. For reference, the inputs it received are {','.join(inputs.keys())}."
                )
            loss = outputs["loss"] if isinstance(outputs, dict) else outputs[0]

        return (loss, outputs) if return_outputs else loss

hill2hill · 2024-06-19T08:23:54Z

So sorry, I actually not familiar with the previous version code, maybe you can try current version.
But it looks that your compute_loss function is fine, the error should only happen inside the self.model.get_vllm_embedding(inputs)

oh i just notice that you load model from cache? maybe git clone the model first and then use your local model_path, it will be easy to modify code and debug.

univa-JASON · 2024-06-19T09:50:21Z

oh, that's ok. thank you so much for your help.

JamesZhutheThird · 2024-07-16T13:07:01Z

Hey guys, are there any updates for this error?

However, when image-text pair data and text-only data were included in the same batch, the following error occurred when running the code.
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 2 but got size 0 for tensor number 2 in the list.

I modified the codes in datasets.py and modeling_minicpmv.py, but the problem still exists. Is there any solution besides setting the batch size to 1? Much appreciated for your contribution.

hill2hill · 2024-07-18T12:26:35Z

Sorry for late reply. Do you fix it now?
In the state of my old code version at that time, training works fine when batch_size > 1. Maybe I should check it later...

univa-JASON · 2024-07-19T04:39:42Z

    def get_vllm_embedding(self, data):
        if 'vision_hidden_states' not in data:
            dtype = self.vpm.embeddings.position_embedding.weight.dtype
            device = self.vpm.embeddings.position_embedding.weight.device
            tgt_sizes = data['tgt_sizes']
            pixel_values_list = data['pixel_values']
            vision_hidden_states = []
            all_pixel_values = []
            img_cnt = []
            for pixel_values in pixel_values_list:
                img_cnt.append(len(pixel_values))
                all_pixel_values.extend([i.flatten(end_dim=1).permute(1, 0) for i in pixel_values])

            # exist image
            if all_pixel_values:
                tgt_sizes = torch.vstack(tgt_sizes).type(torch.int32)

                if self.config.batch_vision_input:
                    max_patches = torch.max(tgt_sizes[:, 0] * tgt_sizes[:, 1])

                    all_pixel_values = torch.nn.utils.rnn.pad_sequence(all_pixel_values, batch_first=True,
                                                                       padding_value=0.0)
                    B, L, _ = all_pixel_values.shape
                    all_pixel_values = all_pixel_values.permute(0, 2, 1).reshape(B, 3, -1, L)

                    patch_attn_mask = torch.zeros((B, 1, max_patches), dtype=torch.bool, device=device)
                    for i in range(B):
                        patch_attn_mask[i, :tgt_sizes[i][0] * tgt_sizes[i][1]] = True

                    vision_embedding = self.vpm(all_pixel_values.type(dtype), patch_attention_mask=patch_attn_mask).last_hidden_state
                    vision_embedding = self.resampler(vision_embedding, tgt_sizes)
                else:
                    # get vision_embedding foreach
                    vision_embedding = []
                    for single_tgt_size, single_pixel_values in zip(tgt_sizes, all_pixel_values):
                        single_pixel_values = single_pixel_values.unsqueeze(0)
                        B, L, _ = single_pixel_values.shape
                        single_pixel_values = single_pixel_values.permute(0, 2, 1).reshape(B, 3, -1, L)
                        single_vision_embedding = self.vpm(single_pixel_values.type(dtype)).last_hidden_state
                        single_vision_embedding = self.resampler(single_vision_embedding, single_tgt_size.unsqueeze(0))
                        vision_embedding.append(single_vision_embedding)
                    vision_embedding = torch.vstack(vision_embedding)

                start = 0
                for pixel_values in pixel_values_list:
                    img_cnt = len(pixel_values)
                    if img_cnt > 0:
                        vision_hidden_states.append(vision_embedding[start: start + img_cnt])
                        start += img_cnt
                    else:
                        vision_hidden_states.append([])
            else: # no image
                if self.training:
                    dummy_image = torch.zeros(
                        (1, 3, 224, 224),
                        device=device, dtype=dtype
                    )
                    tgt_sizes = torch.Tensor([[(224 // self.config.patch_size), math.ceil(224 / self.config.patch_size)]]).type(torch.int32)
                    dummy_feature = self.resampler(self.vpm(dummy_image).last_hidden_state, tgt_sizes)
                else:
                    dummy_feature = []
                for _ in range(len(pixel_values_list)):
                    vision_hidden_states.append(dummy_feature)

        else:
            vision_hidden_states = data['vision_hidden_states']

        if hasattr(self.llm.config, 'scale_emb'):
            vllm_embedding = self.llm.model.embed_tokens(data['input_ids']) * self.llm.config.scale_emb
        else:
            vllm_embedding = self.llm.model.embed_tokens(data['input_ids'])

        vision_hidden_states = [i.type(vllm_embedding.dtype) if isinstance(
            i, torch.Tensor) else i for i in vision_hidden_states]

        bs = len(data['input_ids'])
        for i in range(bs):
            cur_vs_hs = vision_hidden_states[i]
            if len(cur_vs_hs) > 0:
                cur_vllm_emb = vllm_embedding[i]
                cur_image_bound = data['image_bound'][i]
                if len(cur_image_bound) > 0:
                    image_indices = torch.stack(
                        [torch.arange(r[0], r[1], dtype=torch.long) for r in cur_image_bound]
                    ).to(vllm_embedding.device)

                    cur_vllm_emb.scatter_(0, image_indices.view(-1, 1).repeat(1, cur_vllm_emb.shape[-1]),
                                          cur_vs_hs.view(-1, cur_vs_hs.shape[-1]))
                elif self.training:
                    cur_vllm_emb += cur_vs_hs[0].mean() * 0

        return vllm_embedding, vision_hidden_states

in my code, i think because any 1 data has image in batch, if all_pixel_values: is True, so another text-only data in batch caught that error. i can't solve it.

univa-JASON · 2024-07-19T04:42:58Z

1 text-image pair data and 1 text-only data in my sample test, it seems tgt_sizes = torch.tensor([]) in text-only data. but An error occurred because the data entered the
if all_pixel_values: statement.

hill2hill · 2024-07-19T10:07:15Z

Hello, I noticed something. Your code in modeling_minicpmv.py missing some update in my huggingface merge:
It should looks like:

if 'vision_hidden_states' not in data:
    dtype = self.vpm.embeddings.position_embedding.weight.dtype
    device = self.vpm.embeddings.position_embedding.weight.device
    tgt_sizes = data['tgt_sizes']
    pixel_values_list = data['pixel_values']
    vision_hidden_states = []
    all_pixel_values = []
    img_cnt = []
    for pixel_values in pixel_values_list:
        if len(pixel_values) == 0:
            continue
        img_cnt.append(len(pixel_values))
        all_pixel_values.extend([i.flatten(end_dim=1).permute(1, 0) for i in pixel_values])

You might be concerned about the subsequent logic being incoherent, but in reality, that won't be the case. The script modeling_minicpmv.py is compatible with this logic.

I may suggest to load model from local path (not from huggingface, just clone it to your disk), then we can observer or debug easily.

univa-JASON · 2024-07-19T11:10:54Z

thanks a lot for your feedback! i modified the code but sadly got the same error.

colorfulandcjy0806

可以解决我的问题，感谢~

feat: Added judgment logic to support training with plain text data.

85438ec

colorfulandcjy0806 reviewed Aug 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Added judgment logic to support training with plain text data. #281

feat: Added judgment logic to support training with plain text data. #281

hill2hill commented Jun 18, 2024 •

edited

Loading

univa-JASON commented Jun 19, 2024

hill2hill commented Jun 19, 2024 •

edited

Loading

univa-JASON commented Jun 19, 2024

hill2hill commented Jun 19, 2024 •

edited

Loading

univa-JASON commented Jun 19, 2024

JamesZhutheThird commented Jul 16, 2024

hill2hill commented Jul 18, 2024

univa-JASON commented Jul 19, 2024

univa-JASON commented Jul 19, 2024

hill2hill commented Jul 19, 2024 •

edited

Loading

univa-JASON commented Jul 19, 2024

colorfulandcjy0806 left a comment

feat: Added judgment logic to support training with plain text data. #281

Are you sure you want to change the base?

feat: Added judgment logic to support training with plain text data. #281

Conversation

hill2hill commented Jun 18, 2024 • edited Loading

univa-JASON commented Jun 19, 2024

hill2hill commented Jun 19, 2024 • edited Loading

univa-JASON commented Jun 19, 2024

hill2hill commented Jun 19, 2024 • edited Loading

univa-JASON commented Jun 19, 2024

JamesZhutheThird commented Jul 16, 2024

hill2hill commented Jul 18, 2024

univa-JASON commented Jul 19, 2024

univa-JASON commented Jul 19, 2024

hill2hill commented Jul 19, 2024 • edited Loading

univa-JASON commented Jul 19, 2024

colorfulandcjy0806 left a comment

Choose a reason for hiding this comment

hill2hill commented Jun 18, 2024 •

edited

Loading

hill2hill commented Jun 19, 2024 •

edited

Loading

hill2hill commented Jun 19, 2024 •

edited

Loading

hill2hill commented Jul 19, 2024 •

edited

Loading