Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework handling of special tokens #45

Merged
merged 12 commits into from
Sep 20, 2024
Merged

Rework handling of special tokens #45

merged 12 commits into from
Sep 20, 2024

Conversation

francoishernandez
Copy link
Contributor

@francoishernandez francoishernandez commented Jun 28, 2024

This PR is an attempt at facilitating configuration of special tokens, and working with some specificities.

Two main changes :

  • addition of {bos,eos,unk,pad}_token fields in BaseVocabConfig, which are then stored in a "specials" key in the vocab object;
  • addition of optional_eos in PredictConfig to handle cases where we might need several EOS, e.g. Llama3 with <|end_of_text|> and <|eot_id|>

Some open questions / TODOs:

@francoishernandez francoishernandez added the refactor Some refactoring, aesthetic or cleanup code changes label Jul 3, 2024
eole/predict/inference.py Outdated Show resolved Hide resolved
eole/config/run.py Outdated Show resolved Hide resolved
eole/inputters/inputter.py Outdated Show resolved Hide resolved
@vince62s
Copy link
Contributor

vince62s commented Sep 9, 2024

I agree, for default_specials deprecation.
Yes convert_HF needs to be adapted. Also I have some changes that have an impact on this. Even when we have a "sentencepiece" model I will now on also read the tokenizer.json to take into account "added_tokens" hence extend the vocab (not only from the sentencepiece model).
As per a similar discussion we'll also need to be able to "replace" some tokens that need to be preserved but I think this will be a separate PR.

@francoishernandez francoishernandez changed the title [WIP] Rework handling of special tokens Rework handling of special tokens Sep 16, 2024
@francoishernandez
Copy link
Contributor Author

I think we are mostly good on this one.

In the latest commits I:

  • removed the default_specials flag (instead, we grab the values from vocabs["specials"] when needed in build_vocab);
  • updated convert_HF to grab the actual special tokens from special_tokens_map.json (might need to extend if alternatives exist on HF, not sure)
  • (roughly) updated the FAQ entry;

@vince62s @l-k-11235 feedback welcome!

Note: I removed the <0x00> padding patch in convert_HF. I'm not sure if we should 1. keep it 2. adapt it (but in what way?) or 3. ignore it.

Next step should probably be to dive back into #42, and automating the mapped_tokens/optional_eos/inference config stuff as discussed here.

@francoishernandez francoishernandez marked this pull request as ready for review September 16, 2024 14:32
@vince62s
Copy link
Contributor

vince62s commented Sep 16, 2024

I think our code requires a padding token
in gpt2_pretok / BPE models pad token will be handled here: https://github.com/eole-nlp/eole/pull/45/files#diff-fe182c94492e3a828a680a52f0723ef2d6f75f4cd563efed58397f7d43a0364bR963
in sentencepiece models when there is no explicit padding token (look at Mistral) we need to map it to something otherwise our code will break

@francoishernandez francoishernandez marked this pull request as draft September 16, 2024 19:23
@francoishernandez francoishernandez marked this pull request as draft September 16, 2024 19:23
@francoishernandez
Copy link
Contributor Author

francoishernandez commented Sep 16, 2024

Indeed, such situations where no padding token is explicitly set are quite dubious. Not sure it would break, because in many cases we fall back to DefaultTokens.PAD or at the very least , or id 0, but that's relatively unclear.
Also, we still had a lot of cases where DefaultTokens are used instead of the configured specials. Latest commit fix some cases, but stuff like CT2 will require deeper work (if we recover CT2 support at some point).
We'll probably need to test this a bit more in depth, and do another round of cleanup/factorization before merging.

@vince62s vince62s marked this pull request as ready for review September 20, 2024 17:36
@francoishernandez
Copy link
Contributor Author

Let's merge this to move forward.

@francoishernandez francoishernandez merged commit b849e18 into main Sep 20, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactor Some refactoring, aesthetic or cleanup code changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants