From 878f9ea96685beef5381f7ee24edd47e87b31208 Mon Sep 17 00:00:00 2001
From: Hk669 <hrushi669@gmail.com>
Date: Wed, 5 Jun 2024 01:00:16 +0530
Subject: [PATCH] feat: pretrained tokenizers

---
 README.md                                     | 5 ++++-
 wi17k_base.json => pretrained/wi17k_base.json | 0
 2 files changed, 4 insertions(+), 1 deletion(-)
 rename wi17k_base.json => pretrained/wi17k_base.json (100%)

diff --git a/README.md b/README.md
index e526f56..8e0bc80 100644
--- a/README.md
+++ b/README.md
@@ -22,14 +22,17 @@ Every LLM(LLama, Gemini, Mistral..) use their own Tokenizers trained on their ow
 - Compatible with Python 3.9 and above
 
 
-#### This repository has 2 different Tokenizers:
+#### This repository has 3 different Tokenizers:
 - `BPETokenizer`
 - `Tokenizer`
+- `PreTrained`
 
 1. [Tokenizer](bpetokenizer/base.py): This class contains `train`, `encode`, `decode` and functionalities to `save` and `load`. Also contains few helper functions `get_stats`, `merge`, `replace_control_characters`..  to perform the BPE algorithm for the tokenizer.
 
 2. [BPETokenizer](bpetokenizer/tokenizer.py): This class emphasizes the real power of the tokenizer(used in gpt4 tokenizer..[tiktoken](https://github.com/openai/tiktoken)), uses the `GPT4_SPLIT_PATTERN` to split the text as mentioned in the gpt4 tokenizer. also handles the `special_tokens` (refer [sample_bpetokenizer](sample/bpetokenizer/sample_bpetokenizer.py)). which inherits the `save` and `load` functionlities to save and load the tokenizer respectively.
 
+3. [PreTrained Tokenizer](pretrained/wi17k_base.json): PreTrained Tokenizer wi17k_base, has a 17316 vocabulary. trained with the wikitext dataset (len: 1000000). with 6 special_tokens.
+
 
 ### Usage
 
diff --git a/wi17k_base.json b/pretrained/wi17k_base.json
similarity index 100%
rename from wi17k_base.json
rename to pretrained/wi17k_base.json