add lexicographic ordering for breaking ties to make the tokenizer deterministic #90
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I could be wrong, but I think it lexicographic tie breaking is currently not implemented. For example, running the following will yield different results:
custom_text = 'bbbaaaddddcccc'
extra_tokens = 4
tokenizer = RegexTokenizerAK(pattern=GPT2_SPLIT_PATTERN)
tokenizer.train(custom_text,vocab_size=256+extra_tokens,verbose=True)
assert tokenizer.decode(tokenizer.encode(custom_text))==custom_text, 'enc/dec mismatch'
Yields
merge 1/4: (100, 100) -> 256 (b'dd') had 3 occurrences
merge 2/4: (99, 99) -> 257 (b'cc') had 3 occurrences
merge 3/4: (98, 98) -> 258 (b'bb') had 2 occurrences
merge 4/4: (97, 97) -> 259 (b'aa') had 2 occurrences
Whereas
custom_text = 'aaabbbccccdddd'
extra_tokens = 4
tokenizer = RegexTokenizerAK(pattern=GPT2_SPLIT_PATTERN)
tokenizer.train(custom_text,vocab_size=256+extra_tokens,verbose=True)
assert tokenizer.decode(tokenizer.encode(custom_text))==custom_text, 'enc/dec mismatch'
Yields
merge 1/4: (99, 99) -> 256 (b'cc') had 3 occurrences
merge 2/4: (100, 100) -> 257 (b'dd') had 3 occurrences
merge 3/4: (97, 97) -> 258 (b'aa') had 2 occurrences
merge 4/4: (98, 98) -> 259 (b'bb') had 2 occurrences
With the suggested tie breaking, both yield the same result
merge 1/4: (99, 99) -> 256 (b'cc') had 3 occurrences
merge 2/4: (100, 100) -> 257 (b'dd') had 3 occurrences
merge 3/4: (97, 97) -> 258 (b'aa') had 2 occurrences
merge 4/4: (98, 98) -> 259 (b'bb') had 2 occurrences