v0.6.1
🚀 Streaming v0.6.1
Streaming v0.6.1
is released! Install via pip
:
pip install --upgrade mosaicml-streaming==0.6.1
💎 New Features
🚃 Merge meta-data information from sub-directories dataset to form one unified dataset. (#449)
- Addition of the
merge_index()
utility method to merge subdirectories index files from an MDS dataset. The subdirectories can be local or any supported cloud provider URL path. - Checkout dataset conversion and Spark Dataframe to MDS jupyter notebook for an example in action.
🔁 Retry uploading a file to a cloud provider path. (#448)
- Added upload retry logic with backoff and jitter during dataset conversion as part of parameter
retry
in Writer.
from streaming import MDSWriter
with MDSWriter(
...,
retry=3) as out:
for sample in dataset:
out.write(sample)
🐛 Bug Fixes
- Validate Writer arguments and raise a ValueError exception if argument(s) is/are invalid. (#434)
- Terminate the main process if one of the upload threads receives an Exception during dataset conversion. (#448)
🔧 Improvements
- More balancing inter-node downloading for the
py1e
shuffling algorithm by varying shard sample ranges, helping to reduce throughput drops at scale. (#442)
What's Changed
- Validate writer arguments by @karan6181 in #434
- Bump pytest from 7.4.1 to 7.4.2 by @dependabot in #428
- Bump gitpython from 3.1.34 to 3.1.36 by @dependabot in #435
- Fix stylistic issues (mostly 100col, docstring conventions) by @knighton in #439
- Bump pytest-codeblocks from 0.16.1 to 0.17.0 by @dependabot in #436
- py1e randomized by @snarayan21 in #442
- Bump gitpython from 3.1.36 to 3.1.37 by @dependabot in #446
- Fix BatchFeature of Transformers not handled by StreamingDataloader by @Hubert-Bonisseur in #450
- Add a retry logic with backoff and jitter by @karan6181 in #448
- Fix broken bibtext by @Skylion007 in #452
- Update integration test to include sample order comparison by @karan6181 in #456
- Bump pydantic from 2.3.0 to 2.4.2 by @dependabot in #455
- Update MCLI credential page for Databricks by @karan6181 in #466
- Add merge index file utility by @XiaohanZhangCMU in #449
- Add py1e warning when Shuffle block size is smaller than shard size by @snarayan21 in #463
- Fix doc strings by @XiaohanZhangCMU in #469
- Bump fastapi from 0.103.1 to 0.103.2 by @dependabot in #454
- Maintain order for merge_index_from_list by @XiaohanZhangCMU in #472
- Fixed codeql out of disk space issue by @karan6181 in #473
- Bump version to 0.6.1 by @karan6181 in #474
New Contributors
- @Hubert-Bonisseur made their first contribution in #450
Full Changelog: v0.6.0...v0.6.1