Rework handling of special tokens #45

francoishernandez · 2024-06-28T13:55:26Z

This PR is an attempt at facilitating configuration of special tokens, and working with some specificities.

Two main changes :

addition of {bos,eos,unk,pad}_token fields in BaseVocabConfig, which are then stored in a "specials" key in the vocab object;
addition of optional_eos in PredictConfig to handle cases where we might need several EOS, e.g. Llama3 with <|end_of_text|> and <|eot_id|>

Some open questions / TODOs:

what about the default_specials field, we should probably deprecate it in favor of the more flexible new fields ;
conversion tools would need to be adapted to the new logic;
do we want to support HF Tokenizers at some point? if not, we need to work on getting pyonmttok upgraded (pyonmttok installation fails while installing OpenNMT-py with python 3.12 OpenNMT/Tokenizer#329, Seems min C++ standard is now C++17 OpenNMT/Tokenizer#330)

eole/predict/inference.py

eole/config/run.py

eole/inputters/inputter.py

eole/models/model_saver.py

vince62s · 2024-09-09T13:08:14Z

I agree, for default_specials deprecation.
Yes convert_HF needs to be adapted. Also I have some changes that have an impact on this. Even when we have a "sentencepiece" model I will now on also read the tokenizer.json to take into account "added_tokens" hence extend the vocab (not only from the sentencepiece model).
As per a similar discussion we'll also need to be able to "replace" some tokens that need to be preserved but I think this will be a separate PR.

eole/predict/prediction.py

…cial tokens from config

francoishernandez · 2024-09-16T14:32:08Z

I think we are mostly good on this one.

In the latest commits I:

removed the default_specials flag (instead, we grab the values from vocabs["specials"] when needed in build_vocab);
updated convert_HF to grab the actual special tokens from special_tokens_map.json (might need to extend if alternatives exist on HF, not sure)
(roughly) updated the FAQ entry;

@vince62s @l-k-11235 feedback welcome!

Note: I removed the <0x00> padding patch in convert_HF. I'm not sure if we should 1. keep it 2. adapt it (but in what way?) or 3. ignore it.

Next step should probably be to dive back into #42, and automating the mapped_tokens/optional_eos/inference config stuff as discussed here.

vince62s · 2024-09-16T16:41:33Z

I think our code requires a padding token
in gpt2_pretok / BPE models pad token will be handled here: https://github.com/eole-nlp/eole/pull/45/files#diff-fe182c94492e3a828a680a52f0723ef2d6f75f4cd563efed58397f7d43a0364bR963
in sentencepiece models when there is no explicit padding token (look at Mistral) we need to map it to something otherwise our code will break

francoishernandez · 2024-09-16T19:56:24Z

Indeed, such situations where no padding token is explicitly set are quite dubious. Not sure it would break, because in many cases we fall back to DefaultTokens.PAD or at the very least , or id 0, but that's relatively unclear.
Also, we still had a lot of cases where DefaultTokens are used instead of the configured specials. Latest commit fix some cases, but stuff like CT2 will require deeper work (if we recover CT2 support at some point).
We'll probably need to test this a bit more in depth, and do another round of cleanup/factorization before merging.

francoishernandez · 2024-09-20T19:57:09Z

Let's merge this to move forward.

allow configuration of bos/eos/unk/pad, optional_eos for decoding

2d096cd

francoishernandez added the refactor Some refactoring, aesthetic or cleanup code changes label Jul 3, 2024

vince62s reviewed Sep 9, 2024

View reviewed changes

eole/predict/inference.py Outdated Show resolved Hide resolved

vince62s reviewed Sep 9, 2024

View reviewed changes

eole/config/run.py Outdated Show resolved Hide resolved

vince62s reviewed Sep 9, 2024

View reviewed changes

eole/inputters/inputter.py Outdated Show resolved Hide resolved

vince62s reviewed Sep 9, 2024

View reviewed changes

eole/models/model_saver.py Show resolved Hide resolved

vince62s reviewed Sep 11, 2024

View reviewed changes

eole/predict/prediction.py Outdated Show resolved Hide resolved

vince62s mentioned this pull request Sep 12, 2024

Support mapped tokens eg: <im_start> ==> ｟im_start｠in inference.yaml … #102

Merged

francoishernandez added 3 commits September 16, 2024 11:47

fix conflicts

bb9fd27

reorder special tokens definitions for clarity

a5e15b3

deprecate default_specials flag, update convert_HF to grab actual spe…

110a4e2

…cial tokens from config

francoishernandez changed the title ~~[WIP] Rework handling of special tokens~~ Rework handling of special tokens Sep 16, 2024

update special tokens FAQ entry, remove default_specials in comet recipe

825fe5c

francoishernandez marked this pull request as ready for review September 16, 2024 14:32

francoishernandez marked this pull request as draft September 16, 2024 19:23

use specials from vocabs/config instead of DefaultTokens

ea14921

francoishernandez mentioned this pull request Sep 18, 2024

Inference server, lots of related changes #42

Merged

3 tasks

francoishernandez added 2 commits September 20, 2024 14:26

fix conflicts

f5378cc

get rid of unnecessary decoder_start_table

0c0f60b

francoishernandez mentioned this pull request Sep 20, 2024

[patch] minor fixes for 0.0.2 #109

Merged

francoishernandez added 4 commits September 20, 2024 15:38

fix conflicts

0930041

simplify gpt2 inference.yaml

1a21bb8

some fix for better pad_token handling, still not perfect

00eed8b

add some basic run.sh to facilitate testing recipes, fix some configs

955a27f

vince62s marked this pull request as ready for review September 20, 2024 17:36

francoishernandez merged commit b849e18 into main Sep 20, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework handling of special tokens #45

Rework handling of special tokens #45

francoishernandez commented Jun 28, 2024 •

edited

Loading

vince62s commented Sep 9, 2024

francoishernandez commented Sep 16, 2024

vince62s commented Sep 16, 2024 •

edited

Loading

francoishernandez commented Sep 16, 2024 •

edited

Loading

francoishernandez commented Sep 20, 2024

Rework handling of special tokens #45

Rework handling of special tokens #45

Conversation

francoishernandez commented Jun 28, 2024 • edited Loading

vince62s commented Sep 9, 2024

francoishernandez commented Sep 16, 2024

vince62s commented Sep 16, 2024 • edited Loading

francoishernandez commented Sep 16, 2024 • edited Loading

francoishernandez commented Sep 20, 2024

francoishernandez commented Jun 28, 2024 •

edited

Loading

vince62s commented Sep 16, 2024 •

edited

Loading

francoishernandez commented Sep 16, 2024 •

edited

Loading