Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[POC] Supporting standard tokenizers(HF, Mistral) #122

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

francoishernandez
Copy link
Contributor

Disclaimer

This PR is intended more as a POC to open some discussions, rather than to be merged asap.

Context

This work stems from the willingness to start moving towards supporting multi-modal models. First intention was to support Llama3.2 vision models, but the EU-licensing topic pushed me towards having a look at Pixtral first.

Pixtral uses the new Mistral tokenizer, named “tekken”, which is based on tiktoken, and has quite a custom logic to it.
Also, it feels less and less of a good idea to keep relying on the OpenNMT tokenizer, which is neither really maintained, nor really easily extendable to such use cases.

The default behaviour of most tokenizer, HF and Mistral included is to directly provide IDs (what we call numericalize in our stack) rather than keep working on tokens.

Proposed solution

In order not to break our transforms paradigm, which relies on tokens, I gather we can probably enforce some kind of rule to allow “ID tokenization” to be the last transform in the pipe.

This would allow to keep manipulating the text data in any custom way we want prior to applying the "official" tokenization logic. It would also probably simplify the support of "added" tokens, and various templates.

What was tested

  • conversion of standard models (gpt2, llama2, llama3.x)
  • inference + hellaswag for gpt2
  • llama2 inference
  • llama3.x inference and finetuning

Pending topics

  • supporting chat-style datasets and chat templates;
  • making the tokenization more efficient (duplicate src/tgt calls are not needed)
  • making the eos logic clearer
  • supporting MistralTokenizer

Open questions

  • do we want to revisit the whitespace handling logic at this stage? (two topics: \n to ((newline)) conversion + transforms working on lists of space-split tokens instead of full strings)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant