Replies: 9 comments 1 reply
-
Okay I understood a little bit of your code, however I think I was able to understand enough to give a solution.
|
Beta Was this translation helpful? Give feedback.
-
@zacharias1219 i understood the 1, 2 and 4, not very sure about the 3 |
Beta Was this translation helpful? Give feedback.
-
Sure.
|
Beta Was this translation helpful? Give feedback.
-
Also this is my first time contributing, so hope it is up to standard. |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
sure. would love to see a PR on this. |
Beta Was this translation helpful? Give feedback.
-
Okay got your point, so basically it's better to use escape rather than the pair function, will work on it soon. |
Beta Was this translation helpful? Give feedback.
-
Using the re.escape.
Encode the batch of chunks to byte-level IDs
|
Beta Was this translation helpful? Give feedback.
-
What if we use collection.counter to improve the performance Specialized for Counting: collections.Counter is a subclass of Python's dict specifically designed for counting hashable objects (like tuples, strings, or integers). It's optimized for efficiently storing and updating counts for these objects. |
Beta Was this translation helpful? Give feedback.
-
Issue
When passed a very large dataset, the tokenizer's vocabulary-building process takes an excessively long time to complete.
Enhancement Proposal
To address this, adding a
batch_size
parameter to thetrain
method in order to split the texts into more manageable batches.the
train
method at https://github.com/Hk669/bpetokenizer/blob/main/bpetokenizer/tokenizer.pyTasks
batch_size
parameter in thetrain
method to split the docs into batchesbatch_size
in the save.Beta Was this translation helpful? Give feedback.
All reactions