Let's build a GPT Tokenizer!
This repository is a simple implementation of the Byte Pair Encoding Algorithm in Tokenization for LLMs.
tokenization.ipynb
is the note from Andrej Karpathy's Tutorial, which covers the general BPE Algorithm, GPT Tokenization, sentencepiece tokenizer by google used by Llama2 and Mistral, and papers including Efficient Training of Language Models to Fill in the Middle and Language Models are Unsupervised Multitask Learners
Visualization for tokenizers can be found from tiktokenizer.vercel.app.
The implementation of the base.py
, basic.py
, and reg.py
references the minbpe repository,
which represents the generic tokenizer, the basic tokenizer with BPE Algorithm, and the regex tokenizer similar to what OpenAI's Tiktoken for GPT4
and GPT2 Tokenizer is made from.
The basic bpe tokenizer outputs the following result:
The GPT2 tokenizer outouts the following result:
The GPT4 tokenizer outouts the following result: