Releases: mosaicml/streaming
v0.2.4
🚀 Streaming v0.2.4
Streaming v0.2.4 is released! Install via pip:
pip install --upgrade mosaicml-streaming==0.2.4
What's Changed
- Fix Lossy JPEG reencoding for MDS format by @JJGO in #142
- Add message to size assert & change to KeyError by @samhavens in #146
- Synchronize prefix_int across all ranks to resolve hang issue by @karan6181 in #147
- Pin setuptools in build requirements by @dakinggg in #136
- Graphics. by @knighton in #150
- bump version to 0.2.4 by @karan6181 in #151
New Contributors
- @JJGO made their first contribution in #142
- @samhavens made their first contribution in #146
Full Changelog: v0.2.3...v0.2.4
v0.2.3
🚀 Streaming v0.2.3
Streaming v0.2.3 is released! Install via pip:
pip install --upgrade mosaicml-streaming==0.2.3
New Features
- Add scalar MDS encodings data types (#130)
- Support of WebVid-10M dataset (#132)
- Support of LAION-400M dataset (#87)
- Make
StreamingDataset[sample_id]
block to download the given sample's shard if it is not present, so that the dataset can be used lazily (#118) - Support of a Streaming benchmarking script to get time taken by the individual component (#121)
Bug Fixes
- Nuke concat option in C4 dataset (#129)
- Fixed bug report markdown doc (#140)
- Fixed ADE20K dataset conversion script (#133)
What's Changed
- Make getitem block to download shard if not present. by @knighton in #118
- 2022 -> 2023. by @knighton in #119
- Benchmark generating the epoch. by @knighton in #121
- Move datasets dependency into .[dev]. by @knighton in #123
- Bump sphinxcontrib-katex from 0.9.3 to 0.9.4 by @dependabot in #113
- Bump sphinxext-opengraph from 0.7.4 to 0.7.5 by @dependabot in #114
- Bump pytest from 7.2.0 to 7.2.1 by @dependabot in #124
- Bump fastapi from 0.88.0 to 0.89.1 by @dependabot in #125
- Bump yamllint from 1.28.0 to 1.29.0 by @dependabot in #126
- Update paramiko requirement from <3,>=2.11.0 to >=2.11.0,<4 by @dependabot in #127
- Bump nbsphinx from 0.8.11 to 0.8.12 by @dependabot in #128
- Nuke concat option. by @knighton in #129
- Add scalar MDS encodings (data types). by @knighton in #130
- WebVid. by @knighton in #132
- LAION-400M processing by @knighton in #87
- Update isort version by @karan6181 in #135
- Update pre-commit requirement from <3,>=2.18.1 to >=2.18.1,<4 by @dependabot in #134
- Fixed bug report markdown by @karan6181 in #140
- Fix ade20k conversion script by @dblalock in #133
- bump version to 0.2.3 by @karan6181 in #141
Full Changelog: v0.2.2...v0.2.3
v0.2.2
🚀 Streaming v0.2.2
Streaming v0.2.2 is released! Install via pip:
pip install --upgrade mosaicml-streaming==0.2.2
New Features
Bug Fixes
- Get dataloader worker multiprocessing working with spawn, removing Mac OSX fork requirement (#97)
- Improve error messaging (#100)
- Fix CUDA OOM (#103)
- Fix broken source code links in docs (#104)
- Reference the shared memory object in a worker process when using spawn multiprocessing method (#106)
- Release all the StreamingDataset resources during job termination (#107)
What's Changed
- Lazily instantiate the worker barrier in iter (so it all pickles). by @knighton in #97
- linkcode -> viewcode by @dakinggg in #104
- Update writer.py by @sophiawisdom in #100
- Bump sphinxext-opengraph from 0.7.3 to 0.7.4 by @dependabot in #105
- Removed cuda memory allocation which was causing CUDA OOM by @karan6181 in #103
- Reference the shared memory object in a worker process when using spawn multiprocessing method by @karan6181 in #106
- Release all the StreamingDataset resources during job termination by @karan6181 in #107
- Bump gitpython from 3.1.29 to 3.1.30 by @dependabot in #109
- Bump nbsphinx from 0.8.10 to 0.8.11 by @dependabot in #111
- Visualize partitioning by @knighton in #108
- Command-line partitioning visualizer. by @knighton in #115
- Fix (sys.meta_path is None, Python is likely shutting down) by @knighton in #116
- Bump version. by @knighton in #117
New Contributors
- @dakinggg made their first contribution in #104
- @sophiawisdom made their first contribution in #100
Full Changelog: v0.2.1...v0.2.2
v0.2.1
🚀 Streaming v0.2.1
Streaming v0.2.1
is released! Install via pip
:
pip install --upgrade mosaicml-streaming==0.2.1
Bug Fixes
- Make StreamingDataset smarter about when to init dist itself, fixing env var rendezvous problem (#94).
- Shorten shared memory names for Mac OSX (#95).
- Reduce memory usage in StreamingDataset, alleviating inscrutable worker OOMs with large datasets (#96).
- Better exception handling in downloading (#98).
- Hard require fork for dataloader multiprocessing in Mac OSX due to unpickleable objects (#101).
What's Changed
- Also check if dist env vars are set. If not set, don't init dist. by @knighton in #94
- Shorten the names of shared memory objects to make OSX happy. by @knighton in #95
- Just do the partitioning/shuffling in the local leader worker. by @knighton in #96
- propagate the actual exception and raise by @karan6181 in #98
- Set multiprocessing method as fork for Mac OS by @karan6181 in #101
- Bump version to 0.2.1 by @karan6181 in #102
Full Changelog: v0.2.0...v0.2.1
v0.2.0
🚀 Streaming v0.2.0
Streaming v0.2.0
is released! Install via pip
:
pip install --upgrade mosaicml-streaming==0.2.0
New Features
-
Elastic world size deterministic shuffle
Shuffled or not, StreamingDataset now collectively traverses the samples in identical order across all the devices, given a seed and a canonical number of nodes. This ordering holds true even if you checkpoint and resume training of the same epoch on a different number of nodes.
-
Instant Mid-Epoch Resumption
Waiting while your data loader spins to resume from where you left off can be costly! StreamingDataset now lets you resume immediately.
-
NEW StreamingDataLoader
AStreamingDataLoader
is a drop-in replacement for your PyTorchDataLoader
with a Mid-Epoch Resumption functionality where it resumes from where you left off without spinning the dataloader. -
Support for Oracle Cloud Infrastructure (OCI) blob storage
Streaming now supports OCI blob storage as a storage backend for streaming. One can pass the OCI blob storage as either
oci://<bucket_name>@<namespace>/<folder_name>/<filename>
oroci://<bucket_name>/<folder_name>/<filename>
to aStreamingDataset
class. For example:from streaming import StreamingDataset remote = 'oci://<bucket>@<namespace>/<path>' local = '/tmp/dataset/' train_dataset = StreamingDataset(local=local, remote=remote, split='train')
Streaming expects the credentials to be present in
~/.oci/config
path. -
Support for public AWS S3 buckets
Streaming now supports AWS S3 buckets which are public resources that can be accessed without credentials, apart from the already supported private AWS S3 buckets. One can instantiate the
StreamingDataset
class with an AWS S3 bucket as followsfrom streaming import StreamingDataset remote = 's3://<bucket>/<path>' local = '/tmp/dataset/' train_dataset = StreamingDataset(local=local, remote=remote, split='train')
API changes
- The class
Dataset
has been renamed as classStreamingDataset
(#37).- Similarly, built-in most popular datasets class has also been renamed. For example,
C4
renamed asStreamingC4
EnWiki
renamed asStreamingEnWiki
Pile
renamed asStreamingEnWiki
ADE20K
renamed asStreamingADE20K
CIFAR10
renamed asStreamingCIFAR10
COCO
renamed asStreamingCOCO
ImageNet
renamed asStreamingImageNet
- Similarly, built-in most popular datasets class has also been renamed. For example,
- The parameter
prefetch
in classDataset
has been renamed aspredownload
in classStreamingDataset
(#37). - The parameter
retry
in classDataset
has been renamed asdownload_retry
in classStreamingDataset
(#37). - The parameter
timeout
in classDataset
has been renamed asdownload_timeout
in classStreamingDataset
(#37). - The parameter
hash
in classDataset
has been renamed asvalidate_hash
in classStreamingDataset
(#37).
What's Changed
- Bump nbsphinx from 0.8.9 to 0.8.10 by @dependabot in #73
- Bump sphinx-argparse from 0.3.2 to 0.4.0 by @dependabot in #74
- The Pile (conversion + streaming dataset) by @knighton in #71
- [Docs] Switch back to RTD search by @bandish-shah in #83
- make pyright precommit check actually run by @dblalock in #84
- Fixed stale URL references by @bandish-shah in #85
- Bump sphinx-copybutton from 0.5.0 to 0.5.1 by @dependabot in #78
- Bump pandoc from 2.2 to 2.3 by @dependabot in #79
- Bump sphinxcontrib-katex from 0.9.0 to 0.9.3 by @dependabot in #80
- Bump sphinxext-opengraph from 0.7.2 to 0.7.3 by @dependabot in #81
- Support for concat option in C4 Dataset by @karan6181 in #77
- Elastic world size deterministic shuffle with mid-epoch resumption by @knighton in #37
- Support for S3 public bucket by @karan6181 in #88
- Add OCI Cloud Storage support by @karan6181 in #86
- Make StreamingDataset state_dict() more flexible by @knighton in #90
- Bump version to 0.2.0 by @karan6181 in #92
Full Changelog: v0.1.2...v0.2.0
v0.1.2
🚀 Streaming v0.1.2
Streaming v0.1.2
is released! Install via pip
:
pip install --upgrade mosaicml-streaming==0.1.2
What's Changed
- Fixed contributing page link by @karan6181 in #61
- Add Distributed test and supported multi device unittest by @karan6181 in #57
- Added template and adhere to standard coding practice by @karan6181 in #62
- Bump pytest from 7.1.3 to 7.2.0 by @dependabot in #63
- Bump pypandoc from 1.9 to 1.10 by @dependabot in #65
- Add code coverage report and moved scripts outside of src by @karan6181 in #66
- Bump sphinxext-opengraph from 0.6.3 to 0.7.2 by @dependabot in #67
- Add Google Cloud Storage support by @karan6181 in #68
- Create and push release branch as part of workflow by @karan6181 in #69
- Add test CI badge in README by @karan6181 in #70
- Add unit test for download, encodings, hashing, and others by @karan6181 in #72
- Bump version to 0.1.2 by @karan6181 in #75
Full Changelog: v0.1.1...v0.1.2
v0.1.1
🚀 Streaming v0.1.1
Streaming v0.1.1 is released! Install via pip
:
pip install --upgrade mosaicml-streaming==0.1.1
What's Changed
- Streaming datasets V2 by @knighton in #2
- Initial Docs Site by @bandish-shah in #3
- Added a ADE20K and COCO2017 data conversion scripts by @karan6181 in #5
- Added pre-commit config by @karan6181 in #6
- Added pre-commit config for a License Header by @karan6181 in #7
- Convert relative imports to absolute imports by @karan6181 in #8
- C4 dataset by @knighton in #4
- Add a ADE20K streaming dataset class by @karan6181 in #9
- PyPi mods for setup.py by @bandish-shah in #10
- Disable local shard deletion by @knighton in #12
- Add a COCO streaming dataset class by @karan6181 in #13
- Add docstrings. by @knighton in #14
- Added unittest for Writer and Reader by @karan6181 in #16
- added new streaming logos by @ejyuen in #15
- Update package version code for unification by @karan6181 in #17
- Fix wait-for-unzip race by @knighton in #18
- Added algolia search to streaming docs site by @nqn in #19
- Add a pre-commit GitHub workflow by @karan6181 in #21
- Added pydocstyle and docformatter in pre-commit config by @karan6181 in #20
- Improve algorithmic complexity of sample-to-shard lookup from O(log N) to O(1) by @knighton in #22
- Add enwiki-20200101 streaming dataset by @knighton in #23
- Add submodules to api reference doc by @karan6181 in #24
- Initial Docs site content by @bandish-shah in #11
- Add unittest for compression by @karan6181 in #25
- Fix hang when compression is used but compressed files are not retained by @knighton in #26
- Add long_description for packaging by @bandish-shah in #29
- Update tutorial notebooks to have it run end-to-end by @karan6181 in #30
- Adjustment for last partition bug by @knighton in #27
- Fix preprocessing for English Wikipedia dataset by @knighton in #28
- Fix enwiki dataset by @dskhudia in #31
- Skip pre-commit check for enwiki convert skip to have code parity by @karan6181 in #32
- Update doc and fixed reference links by @karan6181 in #33
- Parallel tfrecord creation, validate sample counts vs MDS by @knighton in #34
- Bump up the version to 0.0.1b by @karan6181 in #35
- Add NLP synthetic dataset jupyter notebook tutorial by @karan6181 in #36
- Add README and CONTRIBUTING guide by @karan6181 in #38
- Typos + copy editing in README by @dblalock in #40
- Re-factor docs tutorials to top-level examples by @bandish-shah in #39
- Fixed typos and update documentation by @karan6181 in #42
- Add CodeQL security scanner and Dependabot workflow by @karan6181 in #43
- Bump gitpython from 3.1.28 to 3.1.29 by @dependabot in #46
- Bump myst-parser from 0.16.1 to 0.18.1 by @dependabot in #47
- Add bug report and feature request template by @karan6181 in #48
- mlperf enwiki conversion code mild cleanup by @knighton in #41
- Add Build publish to PyPI and create GitHub release workflow by @karan6181 in #50
- Added writer unittest and update existing test by @karan6181 in #52
- Bump version to 0.1.0 by @karan6181 in #53
- Fixed dead image link in pypi home page by @karan6181 in #54
- Add TorchVision VisionDataset inheritance. by @knighton in #55
- bump version to 0.1.1b0 by @karan6181 in #56
- Fixed rendering of pypi image by @karan6181 in #59
- Bump version to 0.1.1 by @karan6181 in #60
New Contributors
- @knighton made their first contribution in #2
- @bandish-shah made their first contribution in #3
- @karan6181 made their first contribution in #5
- @ejyuen made their first contribution in #15
- @nqn made their first contribution in #19
- @dskhudia made their first contribution in #31
- @dblalock made their first contribution in #40
- @dependabot made their first contribution in #46
Full Changelog: https://github.com/mosaicml/streaming/commits/v0.1.1