Add batch preprocessing if the training dataset is huge. #14

Hk669 · 2024-06-04T15:11:28Z

Hk669
Jun 4, 2024
Maintainer

Issue

When passed a very large dataset, the tokenizer's vocabulary-building process takes an excessively long time to complete.

Enhancement Proposal

To address this, adding a batch_size parameter to the train method in order to split the texts into more manageable batches.
the train method at https://github.com/Hk669/bpetokenizer/blob/main/bpetokenizer/tokenizer.py

Tasks

add the batch_size parameter in the train method to split the docs into batches
write the logic to add the batch_size in the save.

zacharias1219 · 2024-06-06T16:38:53Z

zacharias1219
Jun 6, 2024

Okay I understood a little bit of your code, however I think I was able to understand enough to give a solution.

Add to the train method with a default value
The code will iterate over text_chunks in batches
Instead of merging within the ids list (which would require recreating it for each batch), we directly modify the text_chunks using re.sub
Added a check to break the loop if the stats dictionary is empty

0 replies

Hk669 · 2024-06-06T16:47:17Z

Hk669
Jun 6, 2024
Maintainer Author

Okay I understood a little bit of your code, however I think I was able to understand enough to give a solution.

Add to the train method with a default value

The code will iterate over text_chunks in batches

Instead of merging within the ids list (which would require recreating it for each batch), we directly modify the text_chunks using re.sub

Added a check to break the loop if the stats dictionary is empty

@zacharias1219 i understood the 1, 2 and 4, not very sure about the 3 (modify the text_chunks using re.sub) part. can you give me a brief on that, any code snippet or more information would be appreciated. thanks

0 replies

zacharias1219 · 2024-06-06T16:48:46Z

zacharias1219
Jun 6, 2024

Sure.
This code is part of the def train function in the tokenizer.py
def train(self, texts, vocab_size, verbose=False, min_frequency=1, batch_size=10000):
"""
Train the tokenizer on the given texts and vocab size.

    Args:
        texts (str): The text corpus for training.
        vocab_size (int): The desired vocabulary size.
        verbose (bool, optional): Print progress updates. Defaults to False.
        min_frequency (int, optional): Minimum frequency for merging. Defaults to 1.
        batch_size (int, optional): Number of text chunks to process per batch. 
                                     Defaults to 10000.
    """
    assert vocab_size >= 256
    num_merges = vocab_size - 256

    text_chunks = re.findall(self.compiled_pattern, texts)
    num_chunks = len(text_chunks)
    
    merges = {}
    vocab = {idx: bytes([idx]) for idx in range(256)} 

    # BPE Algorithm with Batch Processing
    for i in range(num_merges):
        stats = {}
        
        for batch_start in range(0, num_chunks, batch_size):
            batch_end = min(batch_start + batch_size, num_chunks)
            batch_chunks = text_chunks[batch_start:batch_end]

            ids = [list(tokens.encode("utf-8")) for tokens in batch_chunks]
            for chunk in ids:
                get_stats(chunk, stats)

        if not stats:  # No more pairs to merge
            break

        pair = max(stats, key=stats.get)

        if stats[pair] < min_frequency:
            break

        idx = 256 + i

        # Efficiently merge within batches
        for batch_start in range(0, num_chunks, batch_size):
            batch_end = min(batch_start + batch_size, num_chunks)
            for j in range(batch_start, batch_end):
                text_chunks[j] = re.sub(f"{re.escape(pair[0])}{re.escape(pair[1])}", chr(idx), text_chunks[j])

        merges[pair] = idx
        vocab[idx] = vocab[pair[0]] + vocab[pair[1]]

        if verbose:
            print(f"Merging {i+1}/{num_merges}: {pair} -> {idx} ({vocab[idx]}) had {stats[pair]} frequency")

    self.merges = merges
    self.vocab = vocab

0 replies

zacharias1219 · 2024-06-06T16:55:02Z

zacharias1219
Jun 6, 2024

Also this is my first time contributing, so hope it is up to standard.

0 replies

Hk669 · 2024-06-06T16:59:11Z

Hk669
Jun 6, 2024
Maintainer Author

Sure. This code is part of the def train function in the tokenizer.py def train(self, texts, vocab_size, verbose=False, min_frequency=1, batch_size=10000): """ Train the tokenizer on the given texts and vocab size.

    Args:
        texts (str): The text corpus for training.
        vocab_size (int): The desired vocabulary size.
        verbose (bool, optional): Print progress updates. Defaults to False.
        min_frequency (int, optional): Minimum frequency for merging. Defaults to 1.
        batch_size (int, optional): Number of text chunks to process per batch. 
                                     Defaults to 10000.
    """
    assert vocab_size >= 256
    num_merges = vocab_size - 256

    text_chunks = re.findall(self.compiled_pattern, texts)
    num_chunks = len(text_chunks)
    
    merges = {}
    vocab = {idx: bytes([idx]) for idx in range(256)} 

    # BPE Algorithm with Batch Processing
    for i in range(num_merges):
        stats = {}
        
        for batch_start in range(0, num_chunks, batch_size):
            batch_end = min(batch_start + batch_size, num_chunks)
            batch_chunks = text_chunks[batch_start:batch_end]

            ids = [list(tokens.encode("utf-8")) for tokens in batch_chunks]
            for chunk in ids:
                get_stats(chunk, stats)

        if not stats:  # No more pairs to merge
            break

        pair = max(stats, key=stats.get)

        if stats[pair] < min_frequency:
            break

        idx = 256 + i

        # Efficiently merge within batches
        for batch_start in range(0, num_chunks, batch_size):
            batch_end = min(batch_start + batch_size, num_chunks)
            for j in range(batch_start, batch_end):
                text_chunks[j] = re.sub(f"{re.escape(pair[0])}{re.escape(pair[1])}", chr(idx), text_chunks[j])

        merges[pair] = idx
        vocab[idx] = vocab[pair[0]] + vocab[pair[1]]

        if verbose:
            print(f"Merging {i+1}/{num_merges}: {pair} -> {idx} ({vocab[idx]}) had {stats[pair]} frequency")

    self.merges = merges
    self.vocab = vocab

the variable pair is a tuple (int, int)

0 replies

Hk669 · 2024-06-06T16:59:35Z

Hk669
Jun 6, 2024
Maintainer Author

Also this is my first time contributing, so hope it is up to standard.

sure. would love to see a PR on this.

0 replies

zacharias1219 · 2024-06-06T17:11:07Z

zacharias1219
Jun 6, 2024

Sure. This code is part of the def train function in the tokenizer.py def train(self, texts, vocab_size, verbose=False, min_frequency=1, batch_size=10000): """ Train the tokenizer on the given texts and vocab size.

    Args:
        texts (str): The text corpus for training.
        vocab_size (int): The desired vocabulary size.
        verbose (bool, optional): Print progress updates. Defaults to False.
        min_frequency (int, optional): Minimum frequency for merging. Defaults to 1.
        batch_size (int, optional): Number of text chunks to process per batch. 
                                     Defaults to 10000.
    """
    assert vocab_size >= 256
    num_merges = vocab_size - 256

    text_chunks = re.findall(self.compiled_pattern, texts)
    num_chunks = len(text_chunks)
    
    merges = {}
    vocab = {idx: bytes([idx]) for idx in range(256)} 

    # BPE Algorithm with Batch Processing
    for i in range(num_merges):
        stats = {}
        
        for batch_start in range(0, num_chunks, batch_size):
            batch_end = min(batch_start + batch_size, num_chunks)
            batch_chunks = text_chunks[batch_start:batch_end]

            ids = [list(tokens.encode("utf-8")) for tokens in batch_chunks]
            for chunk in ids:
                get_stats(chunk, stats)

        if not stats:  # No more pairs to merge
            break

        pair = max(stats, key=stats.get)

        if stats[pair] < min_frequency:
            break

        idx = 256 + i

        # Efficiently merge within batches
        for batch_start in range(0, num_chunks, batch_size):
            batch_end = min(batch_start + batch_size, num_chunks)
            for j in range(batch_start, batch_end):
                text_chunks[j] = re.sub(f"{re.escape(pair[0])}{re.escape(pair[1])}", chr(idx), text_chunks[j])

        merges[pair] = idx
        vocab[idx] = vocab[pair[0]] + vocab[pair[1]]

        if verbose:
            print(f"Merging {i+1}/{num_merges}: {pair} -> {idx} ({vocab[idx]}) had {stats[pair]} frequency")

    self.merges = merges
    self.vocab = vocab

the variable pair is a tuple (int, int)

Okay got your point, so basically it's better to use escape rather than the pair function, will work on it soon.

0 replies

zacharias1219 · 2024-06-07T16:21:13Z

zacharias1219
Jun 7, 2024

Using the re.escape.

for batch_start in range(0, num_chunks, batch_size):
                batch_end = min(batch_start + batch_size, num_chunks)
                for j in range(batch_start, batch_end):
                    text_chunks[j] = re.sub(
                        re.escape(pair[0] + pair[1]), 
                        chr(idx).encode("utf-8"),     
                        text_chunks[j].encode("utf-8") 
                    ).decode("utf-8")

Encode the batch of chunks to byte-level IDs

ids = [list(chunk.encode("utf-8")) for chunk in batch_chunks]  
for chunk_ids in ids:
    get_stats(chunk_ids, stats)

1 reply

Hk669 Jun 7, 2024
Maintainer Author

@zacharias1219 pair: tuple(int, int) and we cannot use the re.escape on it. the re.escape only accepts the string. i might not be sure about the approach, feel free to make changes and try running it locally and please do post the results of it.

zacharias1219 · 2024-06-10T13:31:40Z

zacharias1219
Jun 10, 2024

What if we use collection.counter to improve the performance
I was searching the web for some solutions and I found this

Specialized for Counting: collections.Counter is a subclass of Python's dict specifically designed for counting hashable objects (like tuples, strings, or integers). It's optimized for efficiently storing and updating counts for these objects.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add batch preprocessing if the training dataset is huge. #14

{{title}}

Replies: 9 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Add batch preprocessing if the training dataset is huge. #14

Hk669 Jun 4, 2024 Maintainer

Issue

Enhancement Proposal

Tasks

Replies: 9 comments · 1 reply

zacharias1219 Jun 6, 2024

Hk669 Jun 6, 2024 Maintainer Author

zacharias1219 Jun 6, 2024

zacharias1219 Jun 6, 2024

Hk669 Jun 6, 2024 Maintainer Author

Hk669 Jun 6, 2024 Maintainer Author

zacharias1219 Jun 6, 2024

zacharias1219 Jun 7, 2024

Hk669 Jun 7, 2024 Maintainer Author

zacharias1219 Jun 10, 2024

Hk669
Jun 4, 2024
Maintainer

Replies: 9 comments 1 reply

zacharias1219
Jun 6, 2024

Hk669
Jun 6, 2024
Maintainer Author

zacharias1219
Jun 6, 2024

zacharias1219
Jun 6, 2024

Hk669
Jun 6, 2024
Maintainer Author

Hk669
Jun 6, 2024
Maintainer Author

zacharias1219
Jun 6, 2024

zacharias1219
Jun 7, 2024

Hk669 Jun 7, 2024
Maintainer Author

zacharias1219
Jun 10, 2024