Tokenization mismatch #75

swhoosh · 2024-05-11T10:42:36Z

I tried finetuning my model after stage 1. Apparently, there are tokenization mismatches and the loss is 0.
Do you have any ideas what might be the problem.
Thanks!

sh finetune_full.sh

WARNING: tokenization mismatch: 153 vs. 154. (ignored)
WARNING: tokenization mismatch: 167 vs. 168. (ignored)
WARNING: tokenization mismatch: 56 vs. 57. (ignored)
WARNING: tokenization mismatch: 96 vs. 97. (ignored)
WARNING: tokenization mismatch: 56 vs. 57. (ignored)
WARNING: tokenization mismatch: 62 vs. 63. (ignored)
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 2.6490066225165566e-08, 'epoch': 0.0}

The text was updated successfully, but these errors were encountered:

Isaachhh · 2024-05-11T14:51:29Z

Do you use accurate VERSION?

swhoosh · 2024-05-11T15:03:40Z

I believe so. Here is my finetune_full.sh. fyi I was able to train using a variation of llama-3, namely aaditya/Llama3-OpenBioLLM-8B. I pretrained it myself in a similar manner to meta-llama/Meta-Llama-3-8B-Instruct. Both use the same code just difference base model.

#!/bin/bash

MODEL_TYPE=llama3-8b

PRETRAIN_DIR=bunny-llama3-8b-pretrain
OUTPUT_DIR=bunny-$MODEL_TYPE-s2

mkdir -p ./checkpoints-finetune/$OUTPUT_DIR

deepspeed bunny/train/train.py \
    --use_s2 True \
    --unfreeze_vision_tower False \
    --deepspeed ./script/deepspeed/zero3.json \
    --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct \
    --model_type $MODEL_TYPE \
    --version llama \
    --data_path /image_text/train_list/merged/train_merged_first.json \
    --image_folder /image_text/datasets \
    --vision_tower google/siglip-so400m-patch14-384 \
    --pretrain_mm_mlp_adapter ./checkpoints-pretrain/$PRETRAIN_DIR/mm_projector.bin \
    --mm_projector_type mlp2x_gelu \
    --mm_projector_lr 1e-05 \
    --image_aspect_ratio pad \
    --group_by_modality_length False \
    --bf16 True \
    --output_dir ./checkpoints-finetune/$OUTPUT_DIR \
    --num_train_epochs 3 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 50 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 8 \
    --lazy_preprocess True \
    --report_to none | tee 2>&1 ./checkpoints-finetune/$OUTPUT_DIR/log.txt

Here is my pretrained config.json of both models.

    {
  "_name_or_path": "aaditya/Llama3-OpenBioLLM-8B",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "freeze_mm_mlp_adapter": false,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "image_aspect_ratio": "square",
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "mm_hidden_size": 3456,
  "mm_projector_lr": null,
  "mm_projector_type": "mlp2x_gelu",
  "mm_vision_tower": "google/siglip-so400m-patch14-384",
  "model_type": "bunny-llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "tokenizer_model_max_length": 2048,
  "tokenizer_padding_side": "right",
  "torch_dtype": "bfloat16",
  "transformers_version": "4.39.3",
  "tune_mm_mlp_adapter": true,
  "unfreeze_vision_tower": false,
  "use_cache": true,
  "use_mm_proj": true,
  "use_s2": true,
  "vocab_size": 128256
}

{
  "_name_or_path": "meta-llama/Meta-Llama-3-8B-Instruct",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "freeze_mm_mlp_adapter": false,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "image_aspect_ratio": "square",
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "mm_hidden_size": 3456,
  "mm_projector_lr": null,
  "mm_projector_type": "mlp2x_gelu",
  "mm_vision_tower": "google/siglip-so400m-patch14-384",
  "model_type": "bunny-llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "tokenizer_model_max_length": 2048,
  "tokenizer_padding_side": "right",
  "torch_dtype": "bfloat16",
  "transformers_version": "4.39.3",
  "tune_mm_mlp_adapter": true,
  "unfreeze_vision_tower": false,
  "use_cache": true,
  "use_mm_proj": true,
  "use_s2": true,
  "vocab_size": 128256
}

Isaachhh · 2024-05-11T15:17:59Z

There was a bug that HF Llama-3 wouldn't prepend bos as expected. commit here

Please check whether your model weights are up-to-date.

swhoosh · 2024-05-11T17:00:30Z

Still didn't fix it. I have deleted cached weights and check that the tokenizer.json is the same as the link you provided.

Isaachhh · 2024-05-12T02:19:47Z

We notice that Llama-3 changes eos_token from <|end_of_text|> to <|eot_id|> 2 days ago. We will fix it soon.

Gary2018X · 2024-05-13T03:37:25Z

In theory, can I also use this model? https://huggingface.co/shenzhi-wang/Llama3-8B-Chinese-Chat
Training also has the same error message
I manually modified this file tokenizer.json

id:128001
content:"<|eot_id|>"
single_word:false
lstrip:false
rstrip:false
normalized:false
special:true

I received an error while training again

Traceback (most recent call last):
  File "/root/siton-glusterfs-eaxtsxdfs/xts/projects/Bunny/bunny/train/train.py", line 399, in <module>
    train()
  File "/root/siton-glusterfs-eaxtsxdfs/xts/projects/Bunny/bunny/train/train.py", line 375, in train
    trainer.train(resume_from_checkpoint=True)
  File "/opt/conda/lib/python3.9/site-packages/transformers/trainer.py", line 1780, in train
    return inner_training_loop(
  File "/opt/conda/lib/python3.9/site-packages/transformers/trainer.py", line 1954, in _inner_training_loop
    deepspeed_load_checkpoint(
  File "/opt/conda/lib/python3.9/site-packages/transformers/integrations/deepspeed.py", line 429, in deepspeed_load_checkpoint
    load_path, _ = deepspeed_engine.load_checkpoint(
  File "/opt/conda/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2763, in load_checkpoint
    success = self._load_zero_checkpoint(load_dir, tag, load_optimizer_states=load_optimizer_states)
  File "/opt/conda/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2954, in _load_zero_checkpoint
    self.optimizer.load_state_dict(state_dict_list=zero_sd_list,
  File "/opt/conda/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2625, in load_state_dict
    self._rigid_load_state_dict(state_dict_list[dist.get_rank(group=self.dp_process_group)],
  File "/opt/conda/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2573, in _rigid_load_state_dict
    curr_param.data.copy_(saved_param.data)
RuntimeError: The size of tensor a (449309440) must match the size of tensor b (21495808) at non-singleton dimension 0

train.sh

#!/bin/bash

MODEL_TYPE=llama3-8b

PRETRAIN_DIR=bunny-$MODEL_TYPE-pretrain
OUTPUT_DIR=bunny-lora-juzaol-$MODEL_TYPE

mkdir -p ./checkpoints-$MODEL_TYPE/$OUTPUT_DIR

deepspeed bunny/train/train.py \
    --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \
    --deepspeed ./script/deepspeed/zero3.json \
    --model_name_or_path /models/Llama3-8B-Chinese-Chat \
    --model_type $MODEL_TYPE \
    --version llama \
    --data_path Bunny.json \
    --image_folder /image \
    --vision_tower /models/siglip-so400m-patch14-384 \
    --pretrain_mm_mlp_adapter /models/bunny-pretrain-llama3-8b-siglip/mm_projector.bin \
    --mm_projector_type mlp2x_gelu \
    --image_aspect_ratio pad \
    --group_by_modality_length False \
    --bf16 True \
    --output_dir ./checkpoints-$MODEL_TYPE/$OUTPUT_DIR \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 500 \
    --save_total_limit 1 \
    --learning_rate 2e-4 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --unfreeze_vision_tower True\
    --report_to none | tee 2>&1 ./checkpoints-$MODEL_TYPE/$OUTPUT_DIR/log.txt

Meta-Llama-3-8B-Instruct is also the same error
Is there anything else that needs to be modified?

Isaachhh · 2024-05-13T08:31:33Z

Try to edit here other than editing the configuration of base model.

Gary2018X · 2024-05-13T09:50:07Z

I changed it to this

conv_llama = Conversation(
    system="A chat between a curious user and an artificial intelligence assistant. "
           "The assistant gives helpful, detailed, and polite answers to the user's questions.",
    roles=("USER", "ASSISTANT"),
    version="llama",
    messages=(),
    offset=0,
    sep_style=SeparatorStyle.TWO,
    sep=" ",
    sep2="<|eot_id|>",
)

training is ok
But when I merge the models, there is an error here

Traceback (most recent call last):
  File "/root/siton-glusterfs-eaxtsxdfs/xts/projects/Bunny/script/merge_lora_weights.py", line 26, in <module>
    merge_lora(args)
  File "/root/siton-glusterfs-eaxtsxdfs/xts/projects/Bunny/script/merge_lora_weights.py", line 13, in merge_lora
    model.save_pretrained(args.save_model_path)
  File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2468, in save_pretrained
    safe_save_file(shard, os.path.join(save_directory, shard_file), metadata={"format": "pt"})
  File "/opt/conda/lib/python3.9/site-packages/safetensors/torch.py", line 281, in save_file
    serialize_file(_flatten(tensors), filename, metadata=metadata)
safetensors_rust.SafetensorError: Error while serializing: IoError(Os { code: 5, kind: Uncategorized, message: "Input/output error" })

merged sh

python script/merge_lora_weights.py \
  	--model-path ./checkpoints-llama3-8b/bunny-lora-juzaol-llama3-8b \
  	--model-base ./Llama3-8B-Chinese-Chat \
  	--model-type llama3-8b \
  	--save-model-path ./models/juzao_modelllama38b

swhoosh · 2024-05-13T13:51:46Z

Try to edit here other than editing the configuration of base model.

I edited the config as you said. The training was fine but during inference i got

 "text": "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!",

for every sample.

The output tokens look like this

tensor([[128000,     32,   6369,   1990,    264,  22999,   1217,    323,    459,
          21075,  11478,  18328,     13,    578,  18328,   6835,  11190,     11,
          11944,     11,    323,  48887,  11503,    311,    279,   1217,    596,
           4860,     13,  14194,     25,    220,   -200,    198,  22818,    279,
          15489,    865,  30630,   2217,     11,   7664,  14955,     25,  36660,
           3931,   2891,     25,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0]], device='cuda:0')

Isaachhh · 2024-05-13T14:19:18Z

@Gary2018X It seems no relation with Bunny. Please try to Google.

Isaachhh · 2024-05-13T14:20:40Z

@swhoosh What about the loss curve?

swhoosh · 2024-05-13T14:30:20Z

@swhoosh What about the loss curve?

I trained for only 20 steps just to test it out first. The loss seemed fine when I trained full epochs yesterday where I edited the model's config instead of the bunny's like what you recommended. However, they still had the same problem. FYI, I was able to get the expected result from aaditya/Llama3-OpenBioLLM-8B so I think it should be problem with the config?

Isaachhh · 2024-05-13T14:39:17Z

Maybe there exists a huge gap between medical images and knowledge and regular images and knowledge.

swhoosh · 2024-05-13T14:49:03Z

Well, Phi-2 did actually work during our testing and I was able to get llama3 to work before the recent config update. Can you try reproducing the finetuning result on your end to ensure that the model is behaving correctly?

Isaachhh · 2024-05-13T15:03:14Z

5d9283b works relatively well on our experiments. But we are still working on training and there may still be bugs now.

BTW, the change in train.py is no more needed because this commit.

Gary2018X · 2024-05-14T06:06:18Z

@Gary2018X It seems no relation with Bunny. Please try to Google.

5d9283b works relatively well on our experiments. But we are still working on training and there may still be bugs now.

BTW, the change in train.py is no more needed because this commit.

After being consistent with this change, my problem was resolved and I was able to normally.
Thank you very much

swhoosh · 2024-05-14T14:02:55Z

@Gary2018X are you able to inference? My inference still produce the same result as

Try to edit here other than editing the configuration of base model.

I edited the config as you said. The training was fine but during inference i got

 "text": "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!",

for every sample.

The output tokens look like this

tensor([[128000,     32,   6369,   1990,    264,  22999,   1217,    323,    459,
          21075,  11478,  18328,     13,    578,  18328,   6835,  11190,     11,
          11944,     11,    323,  48887,  11503,    311,    279,   1217,    596,
           4860,     13,  14194,     25,    220,   -200,    198,  22818,    279,
          15489,    865,  30630,   2217,     11,   7664,  14955,     25,  36660,
           3931,   2891,     25,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0]], device='cuda:0')

I had check my llama3 version / use the latest dev branch.

Gary2018X · 2024-05-15T00:40:47Z

It may be related to your base model
I used Llama3-8B-Chinese-Chat instead of Meta-Llama-3-8B-Instruct
Their tokenizer is different

Gary2018X · 2024-05-15T01:04:57Z

Although I can infer normally, the result is not as good as qwen1.8b yet

Isaachhh · 2024-06-01T05:25:20Z

@swhoosh @Gary2018X We would keep using <|end_of_text|> as eos_token for Bunny-Llama3.

Isaachhh · 2024-07-06T09:46:32Z

Close the issue for now if there's no further discussions. Feel free to reopen it if there's any other questions.

Tramac · 2024-11-20T15:26:39Z

@swhoosh @Gary2018X We would keep using <|end_of_text|> as eos_token for Bunny-Llama3.

@Isaachhh Why should keep using <|end_of_text|> as eos_token, if so, doesn't the persistent error affect performance? A very confusing operation.

Isaachhh · 2024-11-21T03:26:41Z

@swhoosh @Gary2018X We would keep using <|end_of_text|> as eos_token for Bunny-Llama3.

@Isaachhh Why should keep using <|end_of_text|> as eos_token, if so, doesn't the persistent error affect performance? A very confusing operation.

Actually, the performance using <|end_of_text|> as eos_token would be better.

Tramac · 2024-11-21T03:37:43Z

@swhoosh @Gary2018X We would keep using <|end_of_text|> as eos_token for Bunny-Llama3.

@Isaachhh Why should keep using <|end_of_text|> as eos_token, if so, doesn't the persistent error affect performance? A very confusing operation.

Actually, the performance using <|end_of_text|> as eos_token would be better.

@Isaachhh So, we just need to ignore such as WARNING: tokenization mismatch: 107 vs. 110. (ignored) log?

But the target value will all be set to IGNORE_INDEX here, in theory, this would result in no supervised information (loss=0), Why performace is better?

Isaachhh · 2024-11-21T03:43:20Z

It shouldn't be any like WARNING: tokenization mismatch: 107 vs. 110. (ignored) because of here. Which base LLM did you use?

Tramac · 2024-11-21T03:47:56Z

It shouldn't be any like WARNING: tokenization mismatch: 107 vs. 110. (ignored) because of here. Which base LLM did you use?

Meta-Llama-3-8B-Insturct, here is my finetune_lora.sh:

#!/bin/bash

MODEL_TYPE=llama3-8b

# params
MODEL_PATH=/data/models/meta-llama/Meta-Llama-3-8B-Instruct
CONV_VERSION=llama
DATA_PATH=/data/datasets/images/bunny-v1_1/finetune/bunny_allava_1.3m.json # recipe-1
BATCH_SIZE=4
GRADIENT_ACCUMULATION_STEPS=4
LORA_R=256
LORA_ALPHA=256
MM_PROJECTOR_LR=1e-5
LR=1e-4
MODEL_MAX_LENGTH=4096

PRETRAIN_DIR=bunny-$MODEL_TYPE-pretrain
OUTPUT_DIR=bunny-lora-$MODEL_TYPE-recipe-1

mkdir -p ./checkpoints-$MODEL_TYPE/$OUTPUT_DIR

deepspeed bunny/train/train.py \
    --lora_enable True --lora_r $LORA_R --lora_alpha $LORA_ALPHA --mm_projector_lr $MM_PROJECTOR_LR \
    --deepspeed ./script/deepspeed/zero3.json \
    --model_name_or_path $MODEL_PATH \
    --model_type $MODEL_TYPE \
    --version $CONV_VERSION \
    --data_path $DATA_PATH \
    --image_folder /data/datasets/images/bunny-v1_1/finetune/images \
    --vision_tower /data/models/google/siglip-so400m-patch14-384 \
    --use_s2 True \
    --pretrain_mm_mlp_adapter ./checkpoints-pretrain/$PRETRAIN_DIR/mm_projector.bin \
    --mm_projector_type mlp2x_gelu \
    --image_aspect_ratio pad \
    --group_by_modality_length False \
    --bf16 True \
    --output_dir ./checkpoints-$MODEL_TYPE/$OUTPUT_DIR \
    --num_train_epochs 1 \
    --per_device_train_batch_size $BATCH_SIZE \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 500 \
    --save_total_limit 1 \
    --learning_rate $LR \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length $MODEL_MAX_LENGTH \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to none | tee 2>&1 ./checkpoints-$MODEL_TYPE/$OUTPUT_DIR/log.txt

Isaachhh · 2024-11-21T05:12:50Z

What's sep2 in your code?

Tramac · 2024-11-21T05:55:36Z

What's sep2 in your code?

<|end_of_text|>, there is no changes.

Isaachhh · 2024-11-21T06:21:26Z

I tested. The training for meta-llama/Meta-Llama-3-8B-Instruct is fine with current code. You may check your code or the version of your LLM.

Tramac · 2024-11-21T07:37:07Z

@Isaachhh

The total_len is the number of all tokens except eos_token, and
the cur_len is the number of all tokens except eos_token + 1 because of here, temporarily record it as total_len + 1.
In llama setting, the pad_token_id = eos_token_id because of here. Therefore, the final cur_len= total_len + 1 - end_token_cnt(here).

So, whether the two value are equal depends on end_token_cnt (rounds num).
Or, why is the value of cur_len init with 1(for <image>\n mask?), and why cur_len still subtracted from end_token_cnt even though it does not contain <|end_of_text|>? Maybe the version of LLaMA causes this?

Isaachhh · 2024-11-21T07:50:21Z

cur_len initializing with 1 is because of the existence of bos_token.

I suggest you print the targets (input_ids) before and after processing of a sample with multiple rounds to gain a better understanding. Generally, the code masks all the question and padding parts of samples.

Tramac · 2024-11-21T08:04:12Z

cur_len initializing with 1 is because of the existence of bos_token.

I suggest you print the targets (input_ids) before and after processing of a sample with multiple rounds to gain a better understanding. Generally, the code masks all the question and padding parts of samples.

OK, given a conversations example (I don't know why there is no bos_token):

["A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nDo you see any letters that are white?\nAnswer the question using a single word or phrase. ASSISTANT: No<|end_of_text|>USER: Is the tire large and black? ASSISTANT: Yes<|end_of_text|>"]

input_ids (init targets):

tensor([[    32,   6369,   1990,    264,  22999,   1217,    323,    459,  21075,
          11478,  18328,     13,    578,  18328,   6835,  11190,     11,  11944,
             11,    323,  48887,  11503,    311,    279,   1217,    596,   4860,
             13,  14194,     25,    220,   -200,    198,   5519,    499,   1518,
            904,  12197,    430,    527,   4251,   5380,  16533,    279,   3488,
           1701,    264,   3254,   3492,    477,  17571,     13,  36660,   3931,
           2891,     25,   2360, 128001,   6584,     25,   2209,    279,  28387,
           3544,    323,   3776,     30,  36660,   3931,   2891,     25,   7566,
         128001]])

ignore the first token:

tensor([[  -100,   6369,   1990,    264,  22999,   1217,    323,    459,  21075,
          11478,  18328,     13,    578,  18328,   6835,  11190,     11,  11944,
             11,    323,  48887,  11503,    311,    279,   1217,    596,   4860,
             13,  14194,     25,    220,   -200,    198,   5519,    499,   1518,
            904,  12197,    430,    527,   4251,   5380,  16533,    279,   3488,
           1701,    264,   3254,   3492,    477,  17571,     13,  36660,   3931,
           2891,     25,   2360, 128001,   6584,     25,   2209,    279,  28387,
           3544,    323,   3776,     30,  36660,   3931,   2891,     25,   7566,
         128001]])

round 1:

tensor([  -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   2360, 128001,   6584,     25,   2209,    279,  28387,
          3544,    323,   3776,     30,  36660,   3931,   2891,     25,   7566,
        128001])

round 2:

tensor([  -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   2360, 128001,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,     25,   7566,
        128001])

There is no bos_token
It seems that the token_id 25 after round 2 should not be kept, it's :.

That's all.

Isaachhh · 2024-11-21T08:25:19Z

So the problem is clear. The tokenizer didn't prepend a bos_token but it should.

Did you clone the latest model from HF?

Tramac · 2024-11-21T09:08:55Z

So the problem is clear. The tokenizer didn't prepend a bos_token but it should.

Did you clone the latest model from HF?

Maybe not, I will re-download and verify it.

Tramac · 2024-11-21T09:34:17Z

@Isaachhh Yes, confirmed that this is only a problem with the LLaMA version, just update config.json and tokenizer_config.json solved it.

Isaachhh mentioned this issue May 28, 2024

On the issue of Continuous Fine-tuning #82

Closed

Isaachhh closed this as completed Jul 6, 2024

Isaachhh reopened this Nov 21, 2024

Tokenization mismatch #75

Tokenization mismatch #75

Comments

swhoosh commented May 11, 2024

Isaachhh commented May 11, 2024

swhoosh commented May 11, 2024 • edited Loading

Isaachhh commented May 11, 2024

swhoosh commented May 11, 2024

Isaachhh commented May 12, 2024

Gary2018X commented May 13, 2024

Isaachhh commented May 13, 2024

Gary2018X commented May 13, 2024

swhoosh commented May 13, 2024

Isaachhh commented May 13, 2024

Isaachhh commented May 13, 2024

swhoosh commented May 13, 2024 • edited Loading

Isaachhh commented May 13, 2024

swhoosh commented May 13, 2024 • edited Loading

Isaachhh commented May 13, 2024 • edited Loading

Gary2018X commented May 14, 2024

swhoosh commented May 14, 2024

Gary2018X commented May 15, 2024

Gary2018X commented May 15, 2024

Isaachhh commented Jun 1, 2024

Isaachhh commented Jul 6, 2024

Tramac commented Nov 20, 2024 • edited Loading

Isaachhh commented Nov 21, 2024

Tramac commented Nov 21, 2024

Isaachhh commented Nov 21, 2024

Tramac commented Nov 21, 2024 • edited Loading

Isaachhh commented Nov 21, 2024

Tramac commented Nov 21, 2024

Isaachhh commented Nov 21, 2024

Tramac commented Nov 21, 2024 • edited Loading

Isaachhh commented Nov 21, 2024

Tramac commented Nov 21, 2024 • edited Loading

Isaachhh commented Nov 21, 2024

Tramac commented Nov 21, 2024

Tramac commented Nov 21, 2024

swhoosh commented May 11, 2024 •

edited

Loading

swhoosh commented May 13, 2024 •

edited

Loading

swhoosh commented May 13, 2024 •

edited

Loading

Isaachhh commented May 13, 2024 •

edited

Loading

Tramac commented Nov 20, 2024 •

edited

Loading

Tramac commented Nov 21, 2024 •

edited

Loading

Tramac commented Nov 21, 2024 •

edited

Loading

Tramac commented Nov 21, 2024 •

edited

Loading