Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should -1 marker (as special token) be counted in vocab_size? #123

Open
mw66 opened this issue Sep 19, 2023 · 1 comment
Open

Should -1 marker (as special token) be counted in vocab_size? #123

mw66 opened this issue Sep 19, 2023 · 1 comment

Comments

@mw66
Copy link

mw66 commented Sep 19, 2023

y[:ndigit*2-1] = -1 # we will only train in the output locations. -1 will mask loss to zero

return 10 # digits 0..9

@VatsaDev
Copy link

To my understanding, we don't add negative values to the tokenizer, we just extend vocab, like this:

# gpt-2 encodings
print("loading GPT-2 encodings...")
enc = tiktoken.get_encoding("gpt2")
encode = lambda s: enc.encode(s, allowed_special={"<endOfText>","<bot>","<human>","<system>"})
decode = lambda l: enc.decode(l)

this just add 4 extra tokenizer tokens to the already ~50000 token vocab
you probably could have a negative tokenizer value(a [-1] token), but you would have to customize tiktoken for that, and adding negative value to the tokenizer means you now have to account for a greater fixed size integer set, which I think would make it slower.

tldr: its possible, but people don't really need negative tokens, its just extra work/slower

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants