Skip to content

v0.17.0

Compare
Choose a tag to compare
@mvpatel2000 mvpatel2000 released this 16 Nov 00:23
· 526 commits to dev since this release
83a40f5

What's New

1. Hybrid Sharded Data Parallel (HSDP) Integration (#2648)

Composer now supports Hybrid Sharded Data Parallel (HSDP), where a model is both sharded and replicated across blocks of controllable size. By default, this will shard a model within a node and replicate across nodes, but Composer will accept a tuple of process groups to specify custom shard/replicate sizes. This can be specified in the FSDP config.

  composer_model = MyComposerModel(n_layers=3)

  fsdp_config = {
      'sharding_strategy': 'HYBRID_SHARD',
  }

  trainer = Trainer(
      model=composer_model,
      max_duration='4ba',
      fsdp_config=fsdp_config,
      ...
  )

HYBRID_SHARD will FULL_SHARD a model whereas _HYBRID_SHARD_ZERO2 will SHARD_GRAD_OP within the shard block.

2. Train Loss NaN Monitor (#2704)

Composer has a new callback which will raise a value error if your loss NaNs out. This is very useful to avoid wasting compute if your training run diverges or fails for numerical reasons.

  from composer.callbacks import NaNMonitor

  composer_model = MyComposerModel(n_layers=3)

  trainer = Trainer(
      model=composer_model,
      max_duration='4ba',
      callbacks=NaNMonitor(),
      ...
  )

Bug Fixes

What's Changed

New Contributors

Full Changelog: v0.16.4...v0.17.0