[POC] Supporting standard tokenizers(HF, Mistral) #122

francoishernandez · 2024-10-02T08:57:48Z

Disclaimer

This PR is intended more as a POC to open some discussions, rather than to be merged asap.

Context

This work stems from the willingness to start moving towards supporting multi-modal models. First intention was to support Llama3.2 vision models, but the EU-licensing topic pushed me towards having a look at Pixtral first.

Pixtral uses the new Mistral tokenizer, named “tekken”, which is based on tiktoken, and has quite a custom logic to it.
Also, it feels less and less of a good idea to keep relying on the OpenNMT tokenizer, which is neither really maintained, nor really easily extendable to such use cases.

The default behaviour of most tokenizer, HF and Mistral included is to directly provide IDs (what we call numericalize in our stack) rather than keep working on tokens.

Proposed solution

In order not to break our transforms paradigm, which relies on tokens, I gather we can probably enforce some kind of rule to allow “ID tokenization” to be the last transform in the pipe.

This would allow to keep manipulating the text data in any custom way we want prior to applying the "official" tokenization logic. It would also probably simplify the support of "added" tokens, and various templates.

What was tested

conversion of standard models (gpt2, llama2, llama3.x)
inference + hellaswag for gpt2
llama2 inference
llama3.x inference and finetuning

Pending topics

supporting chat-style datasets and chat templates;
making the tokenization more efficient (duplicate src/tgt calls are not needed)
making the eos logic clearer
supporting MistralTokenizer

Open questions

do we want to revisit the whitespace handling logic at this stage? (two topics: \n to ((newline)) conversion + transforms working on lists of space-split tokens instead of full strings)

first implementation of id_tokenization (huggingface)

3d76463

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[POC] Supporting standard tokenizers(HF, Mistral) #122

[POC] Supporting standard tokenizers(HF, Mistral) #122

francoishernandez commented Oct 2, 2024

[POC] Supporting standard tokenizers(HF, Mistral) #122

Are you sure you want to change the base?

[POC] Supporting standard tokenizers(HF, Mistral) #122

Conversation

francoishernandez commented Oct 2, 2024

Disclaimer

Context

Proposed solution

What was tested

Pending topics

Open questions