From 878f9ea96685beef5381f7ee24edd47e87b31208 Mon Sep 17 00:00:00 2001 From: Hk669 Date: Wed, 5 Jun 2024 01:00:16 +0530 Subject: [PATCH] feat: pretrained tokenizers --- README.md | 5 ++++- wi17k_base.json => pretrained/wi17k_base.json | 0 2 files changed, 4 insertions(+), 1 deletion(-) rename wi17k_base.json => pretrained/wi17k_base.json (100%) diff --git a/README.md b/README.md index e526f56..8e0bc80 100644 --- a/README.md +++ b/README.md @@ -22,14 +22,17 @@ Every LLM(LLama, Gemini, Mistral..) use their own Tokenizers trained on their ow - Compatible with Python 3.9 and above -#### This repository has 2 different Tokenizers: +#### This repository has 3 different Tokenizers: - `BPETokenizer` - `Tokenizer` +- `PreTrained` 1. [Tokenizer](bpetokenizer/base.py): This class contains `train`, `encode`, `decode` and functionalities to `save` and `load`. Also contains few helper functions `get_stats`, `merge`, `replace_control_characters`.. to perform the BPE algorithm for the tokenizer. 2. [BPETokenizer](bpetokenizer/tokenizer.py): This class emphasizes the real power of the tokenizer(used in gpt4 tokenizer..[tiktoken](https://github.com/openai/tiktoken)), uses the `GPT4_SPLIT_PATTERN` to split the text as mentioned in the gpt4 tokenizer. also handles the `special_tokens` (refer [sample_bpetokenizer](sample/bpetokenizer/sample_bpetokenizer.py)). which inherits the `save` and `load` functionlities to save and load the tokenizer respectively. +3. [PreTrained Tokenizer](pretrained/wi17k_base.json): PreTrained Tokenizer wi17k_base, has a 17316 vocabulary. trained with the wikitext dataset (len: 1000000). with 6 special_tokens. + ### Usage diff --git a/wi17k_base.json b/pretrained/wi17k_base.json similarity index 100% rename from wi17k_base.json rename to pretrained/wi17k_base.json