add lexicographic ordering for breaking ties to make the tokenizer deterministic #90

dapopov-st · 2024-09-21T15:35:59Z

I could be wrong, but I think it lexicographic tie breaking is currently not implemented. For example, running the following will yield different results:
custom_text = 'bbbaaaddddcccc'
extra_tokens = 4
tokenizer = RegexTokenizerAK(pattern=GPT2_SPLIT_PATTERN)
tokenizer.train(custom_text,vocab_size=256+extra_tokens,verbose=True)
assert tokenizer.decode(tokenizer.encode(custom_text))==custom_text, 'enc/dec mismatch'
Yields
merge 1/4: (100, 100) -> 256 (b'dd') had 3 occurrences
merge 2/4: (99, 99) -> 257 (b'cc') had 3 occurrences
merge 3/4: (98, 98) -> 258 (b'bb') had 2 occurrences
merge 4/4: (97, 97) -> 259 (b'aa') had 2 occurrences

Whereas
custom_text = 'aaabbbccccdddd'
extra_tokens = 4
tokenizer = RegexTokenizerAK(pattern=GPT2_SPLIT_PATTERN)
tokenizer.train(custom_text,vocab_size=256+extra_tokens,verbose=True)
assert tokenizer.decode(tokenizer.encode(custom_text))==custom_text, 'enc/dec mismatch'
Yields
merge 1/4: (99, 99) -> 256 (b'cc') had 3 occurrences
merge 2/4: (100, 100) -> 257 (b'dd') had 3 occurrences
merge 3/4: (97, 97) -> 258 (b'aa') had 2 occurrences
merge 4/4: (98, 98) -> 259 (b'bb') had 2 occurrences

With the suggested tie breaking, both yield the same result
merge 1/4: (99, 99) -> 256 (b'cc') had 3 occurrences
merge 2/4: (100, 100) -> 257 (b'dd') had 3 occurrences
merge 3/4: (97, 97) -> 258 (b'aa') had 2 occurrences
merge 4/4: (98, 98) -> 259 (b'bb') had 2 occurrences

…terministic

dapopov-st added 2 commits September 21, 2024 11:07

add lexicographic ordering for breaking ties to make the tokenizer de…

5380099

…terministic

make the algorithm deterministic by breaking ties lexicographically

11ddc69

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add lexicographic ordering for breaking ties to make the tokenizer deterministic #90

add lexicographic ordering for breaking ties to make the tokenizer deterministic #90

dapopov-st commented Sep 21, 2024

add lexicographic ordering for breaking ties to make the tokenizer deterministic #90

Are you sure you want to change the base?

add lexicographic ordering for breaking ties to make the tokenizer deterministic #90

Conversation

dapopov-st commented Sep 21, 2024