Should -1 marker (as special token) be counted in vocab_size? #123

mw66 · 2023-09-19T15:56:59Z

Line 118 in 37baab7

    
           y[:ndigit*2-1] = -1 # we will only train in the output locations. -1 will mask loss to zero

minGPT/projects/adder/adder.py

Line 89 in 37baab7

return 10 # digits 0..9

VatsaDev · 2023-09-24T16:44:51Z

To my understanding, we don't add negative values to the tokenizer, we just extend vocab, like this:

# gpt-2 encodings
print("loading GPT-2 encodings...")
enc = tiktoken.get_encoding("gpt2")
encode = lambda s: enc.encode(s, allowed_special={"<endOfText>","<bot>","<human>","<system>"})
decode = lambda l: enc.decode(l)

this just add 4 extra tokenizer tokens to the already ~50000 token vocab
you probably could have a negative tokenizer value(a [-1] token), but you would have to customize tiktoken for that, and adding negative value to the tokenizer means you now have to account for a greater fixed size integer set, which I think would make it slower.

tldr: its possible, but people don't really need negative tokens, its just extra work/slower

mw66 mentioned this issue Sep 19, 2023

Newbie Q: does the training data (train_ids) has to be consecutive? can I inject -1 as the integer marker id into train_ids? karpathy/nanoGPT#374

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should -1 marker (as special token) be counted in vocab_size? #123

Should -1 marker (as special token) be counted in vocab_size? #123

mw66 commented Sep 19, 2023

VatsaDev commented Sep 24, 2023

Should -1 marker (as special token) be counted in vocab_size? #123

Should -1 marker (as special token) be counted in vocab_size? #123

Comments

mw66 commented Sep 19, 2023

VatsaDev commented Sep 24, 2023