-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A missing space between <s> and [INST] #3
Comments
Here is an example message which consistently cuts off with Mixtral-8x7B after numbers when a bad chat template without the space is used: The information in the message is synthetic; no real personal information. Message embedded in Python code to remove ambiguities about escapings and linefeeds. |
Hello ! Agreed whitespaces are frustratingly important in templates (which is why we've moved to control tokens in our newest templates). I am far from an HF chat template expert but my understanding is :
Your question prompted me to run the same test we have here : https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1#instruct-tokenizer but with v1 compared with the mixtral-8x7b-instruct-v0.1 on HF. The template on HF is wrong. Our ref implementation in this repo is correct. This is what I tested: from mistral_common.protocol.instruct.messages import (
AssistantMessage,
UserMessage,
)
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.tokens.instruct.normalize import ChatCompletionRequest
from transformers import AutoTokenizer
tokenizer_v1 = MistralTokenizer.v1()
mistral_query = ChatCompletionRequest(
messages=[
UserMessage(content="1"),
AssistantMessage(content="2"),
UserMessage(content="3"),
],
model="test",
)
tokenized = tokenizer_v1.encode_chat_completion(mistral_query)
tokenizer_hf = AutoTokenizer.from_pretrained('mistralai/Mixtral-8x7B-Instruct-v0.1')
hf_messages = mistral_query.model_dump()['messages']
tokenized_hf = tokenizer_hf.apply_chat_template(hf_messages, tokenize=True)
print("MISTRAL_COMMON")
print(tokenized.text)
print(tokenizer_hf.convert_ids_to_tokens(tokenized_mistral))
print("HF : MIXTRAL-8x7B")
print(tokenizer_v1.instruct_tokenizer.tokenizer.to_string(tokenized_hf))
print(tokenizer_hf.convert_ids_to_tokens(tokenized_hf))
tokenized_mistral = tokenized.tokens
assert tokenized_hf == tokenized_mistral The output is
The only error I can see in the chat template on HF is the missing space before [/INST] and the assistant message I would extremely gladly accept a PR that fixes it on all our 7B / 8x7B repos, hf tokenizers are a bit mysterious to me ! Another question since you're saying 8x7B behaves badly with mistral-common tokenization :
Thanks ! |
We are calling the model through VLLM with a custom chat template. We don't actually add the What I need here is the confirmation that the chat template we use is correct from Mistral team, because they are the only ones who are able to check what their real chat template is in the training pipelines. |
It actually depends on the tokenizer, after some verifications, it seems like almost all HF templates need to be rewritten (im working on it using
V3 Temporary ChatTemplte:
V3 Chat template Output:
V2 Output:
V2 Temp chat template:
V2 Chat template Output:
V1 Output
V3 Temp chat template:
V3 Chat template output:
I will keep u updated, hopefully this will help a bit for now! |
@keskival Following this, I've updated most of the chat templates on HF, the ones I provided previously are not 100% accurate so I recommend using the ones I've added to the HF repos. They are still not 100% perfect, as there seems to be some issues with the tokenizers themselves on some repos, but they are far better than before and should for most of them match with |
when trying to address a similar issue, I am thinking that the thing is still unclear as per which flags to use with HF. |
Hello! I am trying to use https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2 for fine tuning. But I am not sure about the template. Will this work? formatter_text = f" Thanks! |
I believe there should be a space before
[INST]
here:mistral-common/src/mistral_common/tokens/tokenizers/sentencepiece.py
Line 167 in fcf0316
This is validated by testing. We cannot make the model emit its proper instruct template because it seems to be disinclined to emit it, probably because of training details.
However, the model is very keen on emitting spurious
</s>
if this space is omitted, especially after numbers such as phone numbers, address numbers and dates.Notably, this page gives two different chat templates, one matching the one in this repository written out as such, and the pseudocode below it which is missing the spaces after
[INST]
and after[/INST]
:https://docs.mistral.ai/getting-started/open_weight_models/#chat-template
I believe both of these examples are actually incorrect.
Then there is this source:
https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1
It shows the space between
<s>
and[INST]
, which I believe is correct. Adding that space to the template gives superior output quality based on qualitative tests.However, the pseudocode given in that source matches the pseudocode from Mistral documentation and is different from what is shown, and I believe incorrect as well.
Please look into this, and be clear what the chat template actually is; whitespace is very very important.
The text was updated successfully, but these errors were encountered: