v0.17.0
What's New
1. Hybrid Sharded Data Parallel (HSDP) Integration (#2648)
Composer now supports Hybrid Sharded Data Parallel (HSDP), where a model is both sharded and replicated across blocks of controllable size. By default, this will shard a model within a node and replicate across nodes, but Composer will accept a tuple of process groups to specify custom shard/replicate sizes. This can be specified in the FSDP config.
composer_model = MyComposerModel(n_layers=3)
fsdp_config = {
'sharding_strategy': 'HYBRID_SHARD',
}
trainer = Trainer(
model=composer_model,
max_duration='4ba',
fsdp_config=fsdp_config,
...
)
HYBRID_SHARD
will FULL_SHARD
a model whereas _HYBRID_SHARD_ZERO2
will SHARD_GRAD_OP
within the shard block.
2. Train Loss NaN Monitor (#2704)
Composer has a new callback which will raise a value error if your loss NaNs out. This is very useful to avoid wasting compute if your training run diverges or fails for numerical reasons.
from composer.callbacks import NaNMonitor
composer_model = MyComposerModel(n_layers=3)
trainer = Trainer(
model=composer_model,
max_duration='4ba',
callbacks=NaNMonitor(),
...
)
Bug Fixes
- Fix MPS with dict loss by @mvpatel2000 in #2706
- Squelch Memory Monitor warnings if device=meta by @hanlint in #2529
- Switch mosaicml logger to use futures to enable better error handling by @j316chuck in #2702
What's Changed
- Add partial state dict functionality for FSDP by @b-chu in #2637
- Update monai requirement from <1.3,>=0.9.1 to >=0.9.1,<1.4 by @dependabot in #2643
- Bump pytest-codeblocks from 0.16.1 to 0.17.0 by @dependabot in #2645
- Remove checkpoint on close by @mvpatel2000 in #2646
- Update latest to 2.1 by @mvpatel2000 in #2650
- HSDP Support by @mvpatel2000 in #2648
- Log profile averages by @j316chuck in #2647
- Daily API key by @mvpatel2000 in #2655
- Add automatic remote uploader downloader for composer profiler by @j316chuck in #2653
- Update the AWS_OFI_NCCL version and add in the MPI HWLOC install by @willgleich in #2651
- Fix GCP tests by @mvpatel2000 in #2658
- Allow no eval_loader when eval is disabled by @b-chu in #2657
- Gate HSDP by torch 2.1.0 by @mvpatel2000 in #2656
- Fix FSDP arg default to match torch by @mvpatel2000 in #2660
- Bump pypandoc from 1.11 to 1.12 by @dependabot in #2664
- Bump vit-pytorch from 0.35.8 to 1.6.1 by @dependabot in #2662
- Upgrade to transformers 4.34.1 by @dakinggg in #2635
- Update docker readme by @mvpatel2000 in #2669
- Add script to validate remote object store paths by @irenedea in #2667
- Torch 2.1 Resumption Support by @mvpatel2000 in #2665
- Bump gitpython from 3.1.37 to 3.1.40 by @dependabot in #2663
- Fix dist by @mvpatel2000 in #2670
- Add torch nightly for torch 2.2.0 10-24 by @j316chuck in #2671
- Adding Model Data Init and Training Progress to MosaicMLLogger by @jjanezhang in #2633
- Bump pytest from 7.4.2 to 7.4.3 by @dependabot in #2678
- Bump sphinxext-opengraph from 0.8.2 to 0.9.0 by @dependabot in #2677
- Bump traitlets from 5.10.0 to 5.12.0 by @dependabot in #2674
- Bump cryptography from 41.0.4 to 41.0.5 by @dependabot in #2675
- Secure Code Eval changes by @mvpatel2000 in #2679
- Lazy validation of code eval metric by @mvpatel2000 in #2681
- Upgrade transformers to 4.35 by @dakinggg in #2684
- Bump traitlets from 5.12.0 to 5.13.0 by @dependabot in #2687
- Bump ipykernel from 6.25.2 to 6.26.0 by @dependabot in #2686
- Add Kwargs to upload_object by @nik-mosaic in #2692
- Add version number to composer metadata logs by @j316chuck in #2565
- Add distributed barrier test fixture to ensure pytest cleans up resources properly by @j316chuck in #2694
- Properly handle empty metric_names passed to Trainer._filter_metrics by @irenedea in #2700
- Train loss NaN checking callback by @coryMosaicML in #2704
- Adding logging and force flushing for run events by @jjanezhang in #2703
- [daily-test fix] Add rank 0 gating to test_elastic_resumption state dict comparison by @eracah in #2705
- Fix MPS with dict loss by @mvpatel2000 in #2706
- Update types to follow PEP 585 by @b-chu in #2697
- Bump yamllint from 1.32.0 to 1.33.0 by @dependabot in #2708
- Update wandb requirement from <0.16,>=0.13.2 to >=0.13.2,<0.17 by @dependabot in #2709
- Squelch Memory Monitor warnings if device=meta by @hanlint in #2529
- Fix NaN monitor for loss dicts. by @coryMosaicML in #2712
- Switch mosaicml logger to use futures to enable better error handling by @j316chuck in #2702
- Fetching arguments for FSDP by @mvpatel2000 in #2710
- Bump version to 0.17 by @mvpatel2000 in #2711
New Contributors
- @willgleich made their first contribution in #2651
- @jjanezhang made their first contribution in #2633
Full Changelog: v0.16.4...v0.17.0