Deduplication of text chunks with frequency count, training and encoding 5x speedup #82

Majdoddin · 2024-06-08T13:36:30Z

In RegexTokenizer, the training text is initially split into chunks, and further processing is performed on individual chunks. This PR optimizes the process by retaining only unique chunks and their corresponding frequency counts. Practically this cuts the number of chunks to 1/7th, resulting in a training speedup of at least 5x.

Similar optimization for encode_ordinary(), where the tokenization of each string is cached. Also 5x speedup.

…s occurences in idsw

… string in ids and use the cached value for the rest

ae99

Independently arrived at this same change myself!

Took training of a 8192 size tokenizer on ~3m words from 8 hours, down to 20mins. I'll likely now re-train on ~1b words given this makes training almost entirely independent of dataset size after the regex is complete. Makes this repo production viable!

ae99 · 2024-10-23T08:42:07Z

minbpe/regex.py

@@ -41,17 +41,26 @@ def train(self, text, vocab_size, verbose=False):
        text_chunks = re.findall(self.compiled_pattern, text)


Faster to do the counting ahead of converting to unicode

Suggested change

text_chunks = re.findall(self.compiled_pattern, text)

text_chunks = re.findall(self.compiled_pattern, text)

chunks_counted = collections.Counter(text_chunks)

text_chunks = [chunk for chunk, count in chunks_counted.items()]

global_counts = [count for chunk, count in chunks_counted.items()]

Then further down we can just go:

for chunk_ids, global_count in zip(ids, global_counts): # passing in stats will update it in place, adding up counts get_stats_n(chunk_ids, stats, global_count)

Majdoddin added 3 commits June 8, 2024 14:26

in train(), just one instance of each string in ids, keep count of it…

aa6839b

…s occurences in idsw

little code clean up

ef17592

In encode_ordinary(), run _encode_chunk just for one instance of each…

ae3b914

… string in ids and use the cached value for the rest

Majdoddin changed the title ~~Deduplication of text chunks with frequency count, 5x training speedup~~ Deduplication of text chunks with frequency count, training and encoding 5x speedup Jun 9, 2024

ae99 approved these changes Oct 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deduplication of text chunks with frequency count, training and encoding 5x speedup #82

Deduplication of text chunks with frequency count, training and encoding 5x speedup #82

Majdoddin commented Jun 8, 2024 •

edited

Loading

ae99 left a comment

ae99 Oct 23, 2024

		@@ -41,17 +41,26 @@ def train(self, text, vocab_size, verbose=False):
		text_chunks = re.findall(self.compiled_pattern, text)

-        text_chunks = re.findall(self.compiled_pattern, text)
+        text_chunks = re.findall(self.compiled_pattern, text)
+        chunks_counted = collections.Counter(text_chunks)
+        text_chunks = [chunk for chunk, count in chunks_counted.items()]
+        global_counts = [count for chunk, count in chunks_counted.items()]

Deduplication of text chunks with frequency count, training and encoding 5x speedup #82

Are you sure you want to change the base?

Deduplication of text chunks with frequency count, training and encoding 5x speedup #82

Conversation

Majdoddin commented Jun 8, 2024 • edited Loading

ae99 left a comment

Choose a reason for hiding this comment

ae99 Oct 23, 2024

Choose a reason for hiding this comment

Majdoddin commented Jun 8, 2024 •

edited

Loading