Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Address the feedback on the tokenizer's library (dotnet#7024)
* Fix cache when calling EncodeToIds * Make EnglishRoberta _mergeRanks thread safe * Delete Trainer * Remove the setters on the Bpe properties * Remove Roberta and Tiktoken special casing in the Tokenizer and support the cases in the Model abstraction * Support text-embedding-3-small/large embedding * Remove redundant TokenToId abstraction and keep the one with the extra parameters * Enable creating Tiktoken asynchronously or directly using the tokenizer data * Add cancellationToken support in CreateAsync APIs * Rename sequence to text and Tokenize to Encode * Rename skipSpecialTokens to considerSpecialTokens * Rename TokenizerResult to EncodingResult * Make Token publicly immutable * Change offset tuples from (Index, End) to (Index, Length) * Rename NormalizedString method's parameters * Rename Model's methods to start with verb * Convert Model.GetVocab() method to a Vocab property * Some method's parameters and variable renaming * Remove Vocab and VocabSize from the abstraction * Cleanup normalization support * Minor Bpe cleanup * Resolve rebase change * Address the feedback
- Loading branch information