[POC] Supporting standard tokenizers(HF, Mistral) #122
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Disclaimer
This PR is intended more as a POC to open some discussions, rather than to be merged asap.
Context
This work stems from the willingness to start moving towards supporting multi-modal models. First intention was to support Llama3.2 vision models, but the EU-licensing topic pushed me towards having a look at Pixtral first.
Pixtral uses the new Mistral tokenizer, named “tekken”, which is based on tiktoken, and has quite a custom logic to it.
Also, it feels less and less of a good idea to keep relying on the OpenNMT tokenizer, which is neither really maintained, nor really easily extendable to such use cases.
The default behaviour of most tokenizer, HF and Mistral included is to directly provide IDs (what we call numericalize in our stack) rather than keep working on tokens.
Proposed solution
In order not to break our
transforms
paradigm, which relies on tokens, I gather we can probably enforce some kind of rule to allow “ID tokenization” to be the last transform in the pipe.This would allow to keep manipulating the text data in any custom way we want prior to applying the "official" tokenization logic. It would also probably simplify the support of "added" tokens, and various templates.
What was tested
Pending topics
Open questions