Release v0.17.0 · mosaicml/composer

What's New

1. Hybrid Sharded Data Parallel (HSDP) Integration (#2648)

Composer now supports Hybrid Sharded Data Parallel (HSDP), where a model is both sharded and replicated across blocks of controllable size. By default, this will shard a model within a node and replicate across nodes, but Composer will accept a tuple of process groups to specify custom shard/replicate sizes. This can be specified in the FSDP config.

  composer_model = MyComposerModel(n_layers=3)

  fsdp_config = {
      'sharding_strategy': 'HYBRID_SHARD',
  }

  trainer = Trainer(
      model=composer_model,
      max_duration='4ba',
      fsdp_config=fsdp_config,
      ...
  )

HYBRID_SHARD will FULL_SHARD a model whereas _HYBRID_SHARD_ZERO2 will SHARD_GRAD_OP within the shard block.

2. Train Loss NaN Monitor (#2704)

Composer has a new callback which will raise a value error if your loss NaNs out. This is very useful to avoid wasting compute if your training run diverges or fails for numerical reasons.

  from composer.callbacks import NaNMonitor

  composer_model = MyComposerModel(n_layers=3)

  trainer = Trainer(
      model=composer_model,
      max_duration='4ba',
      callbacks=NaNMonitor(),
      ...
  )

Bug Fixes

Fix MPS with dict loss by @mvpatel2000 in #2706
Squelch Memory Monitor warnings if device=meta by @hanlint in #2529
Switch mosaicml logger to use futures to enable better error handling by @j316chuck in #2702

What's Changed

Add partial state dict functionality for FSDP by @b-chu in #2637
Update monai requirement from <1.3,>=0.9.1 to >=0.9.1,<1.4 by @dependabot in #2643
Bump pytest-codeblocks from 0.16.1 to 0.17.0 by @dependabot in #2645
Remove checkpoint on close by @mvpatel2000 in #2646
Update latest to 2.1 by @mvpatel2000 in #2650
HSDP Support by @mvpatel2000 in #2648
Log profile averages by @j316chuck in #2647
Daily API key by @mvpatel2000 in #2655
Add automatic remote uploader downloader for composer profiler by @j316chuck in #2653
Update the AWS_OFI_NCCL version and add in the MPI HWLOC install by @willgleich in #2651
Fix GCP tests by @mvpatel2000 in #2658
Allow no eval_loader when eval is disabled by @b-chu in #2657
Gate HSDP by torch 2.1.0 by @mvpatel2000 in #2656
Fix FSDP arg default to match torch by @mvpatel2000 in #2660
Bump pypandoc from 1.11 to 1.12 by @dependabot in #2664
Bump vit-pytorch from 0.35.8 to 1.6.1 by @dependabot in #2662
Upgrade to transformers 4.34.1 by @dakinggg in #2635
Update docker readme by @mvpatel2000 in #2669
Add script to validate remote object store paths by @irenedea in #2667
Torch 2.1 Resumption Support by @mvpatel2000 in #2665
Bump gitpython from 3.1.37 to 3.1.40 by @dependabot in #2663
Fix dist by @mvpatel2000 in #2670
Add torch nightly for torch 2.2.0 10-24 by @j316chuck in #2671
Adding Model Data Init and Training Progress to MosaicMLLogger by @jjanezhang in #2633
Bump pytest from 7.4.2 to 7.4.3 by @dependabot in #2678
Bump sphinxext-opengraph from 0.8.2 to 0.9.0 by @dependabot in #2677
Bump traitlets from 5.10.0 to 5.12.0 by @dependabot in #2674
Bump cryptography from 41.0.4 to 41.0.5 by @dependabot in #2675
Secure Code Eval changes by @mvpatel2000 in #2679
Lazy validation of code eval metric by @mvpatel2000 in #2681
Upgrade transformers to 4.35 by @dakinggg in #2684
Bump traitlets from 5.12.0 to 5.13.0 by @dependabot in #2687
Bump ipykernel from 6.25.2 to 6.26.0 by @dependabot in #2686
Add Kwargs to upload_object by @nik-mosaic in #2692
Add version number to composer metadata logs by @j316chuck in #2565
Add distributed barrier test fixture to ensure pytest cleans up resources properly by @j316chuck in #2694
Properly handle empty metric_names passed to Trainer._filter_metrics by @irenedea in #2700
Train loss NaN checking callback by @coryMosaicML in #2704
Adding logging and force flushing for run events by @jjanezhang in #2703
[daily-test fix] Add rank 0 gating to test_elastic_resumption state dict comparison by @eracah in #2705
Fix MPS with dict loss by @mvpatel2000 in #2706
Update types to follow PEP 585 by @b-chu in #2697
Bump yamllint from 1.32.0 to 1.33.0 by @dependabot in #2708
Update wandb requirement from <0.16,>=0.13.2 to >=0.13.2,<0.17 by @dependabot in #2709
Squelch Memory Monitor warnings if device=meta by @hanlint in #2529
Fix NaN monitor for loss dicts. by @coryMosaicML in #2712
Switch mosaicml logger to use futures to enable better error handling by @j316chuck in #2702
Fetching arguments for FSDP by @mvpatel2000 in #2710
Bump version to 0.17 by @mvpatel2000 in #2711

New Contributors

@willgleich made their first contribution in #2651
@jjanezhang made their first contribution in #2633

Full Changelog: v0.16.4...v0.17.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.17.0

What's New

Bug Fixes

What's Changed

New Contributors

Contributors