Tokenization

Let's build a GPT Tokenizer!

This repository is a simple implementation of the Byte Pair Encoding Algorithm in Tokenization for LLMs.

tokenization.ipynb is the note from Andrej Karpathy's Tutorial, which covers the general BPE Algorithm, GPT Tokenization, sentencepiece tokenizer by google used by Llama2 and Mistral, and papers including Efficient Training of Language Models to Fill in the Middle and Language Models are Unsupervised Multitask Learners

Visualization for tokenizers can be found from tiktokenizer.vercel.app.

The implementation of the base.py, basic.py, and reg.py references the minbpe repository, which represents the generic tokenizer, the basic tokenizer with BPE Algorithm, and the regex tokenizer similar to what OpenAI's Tiktoken for GPT4 and GPT2 Tokenizer is made from.

The basic bpe tokenizer outputs the following result:

The GPT2 tokenizer outouts the following result:

The GPT4 tokenizer outouts the following result:

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.idea		.idea
__pycache__		__pycache__
README.md		README.md
base.py		base.py
basic.py		basic.py
reg.py		reg.py
taylorswift.txt		taylorswift.txt
tokenization.ipynb		tokenization.ipynb
visual.py		visual.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tokenization

About

Releases

Packages

Languages

yebyyy/Tokenization

Folders and files

Latest commit

History

Repository files navigation

Tokenization

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages