You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We just uploaded the raw data (tokenized, unpacked, unfiltered) and added the download instructions to README. We also added a reference to datatools, the codebase we used to process/filter/pack data. We'll add the readme for it soon but it should also be relatively easy to implement a simple filtering/packing logic on top of our tokenized raw data.
Hi authors, congrats on the great work!
Would it be possible to share your recipe for creating the training dataset? I am looking to create a similar dataset with a different tokenizer.
Thanks in advance:)
The text was updated successfully, but these errors were encountered: