Skip to content

Releases: mosaicml/composer

v0.27.0

14 Nov 19:35
Compare
Choose a tag to compare

What's New

1. Torch 2.5.1 Compatibility (#3701)

We've added support for torch 2.5.1, including checkpointing bug fixes from PyTorch.

2. Add batch/microbatch transforms (#3703)

Sped up device transformations by doing batch transform on CPU and microbatch transforms on GPU

Deprecations and Breaking Changes

1. MLFlow Metrics Deduplication (#3678)

We added a metric de-duplication feature for the MLflow logger in Composer. Metrics that remain unchanged since the last step are not logged unless specific conditions are met, which by default is if we've reached a 100th multiple of duplicated metric steps. This optimizes logging storage by reducing redundant entries, balancing detailed sampling with efficiency.

Example:

MlflowLogger(..., log_duplicated_metric_every_n_steps=100)

What's Changed

Full Changelog: v0.26.1...v0.27.0

v0.26.1

01 Nov 06:07
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.26.0...v0.26.1

v0.26.0

25 Oct 21:36
Compare
Choose a tag to compare

What's New

1. Torch 2.5.0 Compatibility (#3609)

We've added support for torch 2.5.0, including necessary patches to Torch.

Deprecations and Breaking Changes

1. FSDP Configuration Changes(#3681)

We no longer support passing fsdp_config and fsdp_auto_wrap directly to Trainer.

If you'd like to specify an fsdp config and configure fsdp auto wrapping, you should use parallelism_config.

trainer = Trainer(
    parallelism_config = {
        'fsdp': { 
            'auto_wrap': True
            ...
        }
    }
)

2. Removal of Pytorch Legacy Sharded Checkpoint Support (#3631)

PyTorch briefly used a different sharded checkpoint format than the current one, which was quickly deprecated by PyTorch. We have removed support for this format. We initially removed support for saving in this format in #2262, and the original feature was added in #1902. Please reach out if you have concerns or need help converting your checkpoints to the new format.

What's Changed

New Contributors

Full Changelog: v0.25.0...v0.26.0

v0.25.0

24 Sep 20:56
Compare
Choose a tag to compare

What's New

1. Torch 2.4.1 Compatibility (#3609)

We've added support for torch 2.4.1, including necessary patches to Torch.

Deprecations and breaking changes

1. Microbatch device movement (#3567)

Instead of moving the entire batch to device at once, we now move each microbatch to device. This saves memory for large inputs, e.g. multimodal data, when training with many microbatches.

This change may affect certain callbacks which run operations on the batch which require it to be moved to an accelerator ahead of time, such as the two changed in this PR. There shouldn't be too many of these callbacks, so we anticipate this change will be relatively safe.

2. DeepSpeed deprecation version (#3634)

We have update the Composer version that we will remove support for DeepSpeed to 0.27.0. Please reach out on GitHub if you have any concerns about this.

3. PyTorch legacy sharded checkpoint format

PyTorch briefly used a different sharded checkpoint format than the current one, which was quickly deprecated by PyTorch. We have continued to support loading legacy format checkpoints for a while, but we will likely be removing support for this format entirely in an upcoming release. We initially removed support for saving in this format in #2262, and the original feature was added in #1902. Please reach out if you have concerns or need help converting your checkpoints to the new format.

What's Changed

Full Changelog: v0.24.1...v0.25.0

v0.24.1

27 Aug 22:37
3c7fefb
Compare
Choose a tag to compare

Bug Fixes

1. Disallow passing device_mesh to FSDPConfig (#3580)

Explicitly errors if device_mesh is passed to FSDPConfig. This completes the deprecation from v0.24.0 and also addresses cases where a user specified a device mesh but it was ignored, leading to training with the incorrect parallelism style (e.g., using FSDP instead of HSDP).

What's Changed

Full Changelog: v0.24.0...v0.24.1

v0.24.0

26 Aug 14:48
020b0ef
Compare
Choose a tag to compare

What's New

1. Torch 2.4 Compatibility (#3542, #3549, #3553, #3552, #3565)

Composer now supports Torch 2.4! We are tracking a few issues with the latest PyTorch we have raised with the PyTorch team related to checkpointing:

  • [PyTorch Issue] Distributed checkpointing using PyTorch DCP has issues with stateless optimizers, e.g. SGD. We recommend using composer.optim.DecoupledSGDW as a workaround.
  • [PyTorch Issue] Distributed checkpointing using PyTorch DCP broke backwards compatibility. We have patched this using the following planner, but this may break custom planner loading.

2. New checkpointing APIs (#3447, #3474, #3488, #3452)

We've added new checkpointing APIs to download, upload, and load / save, so that checkpointing is usable outside of a Trainer object. We will be fully migrating to these new APIs in the next minor release.

3: Improved Auto-microbatching (#3510, #3522)

We've fixed deadlocks with auto-microbatching with FSDP, bringing throughput in line with manually setting the microbatch size. This is achieved through enabling sync hooks wherever a training run might OOM to find the correct microbatch size, and disabling these hooks for the rest of training.

Bug Fixes

1. Fix checkpoint symlink uploads (#3376)

Ensures that checkpoint files are uploaded before the symlink file, fixing errors with missing or incomplete checkpoints.

2. Optimizer tracks same parameters after FSDP wrapping (#3502)

When only a subset of parameters should be tracked by the optimizer, FSDP wrapping will now not interfere.

What's Changed

Read more

v0.23.5

03 Jul 02:08
56ccc2e
Compare
Choose a tag to compare

What's New

1. Variable length dataloaders (#3416)

Adds support for dataloaders with rank-dependent lengths. The solution terminates iteration for dataloaders on all ranks when the first dataloader finishes.

Bug Fixed

1. Remove close flush for mosaicml logger (#3446)

Previously, the MosaicML Logger sporadically raised an error when the python interpreter was shutting down as it attempted to flush data on Event.CLOSE using futures, which cannot be scheduled at that time. Instead, we now only block on finishing existing data upload on Event.CLOSE, avoiding scheduling new futures.

What's Changed

Full Changelog: v0.23.4...v0.23.5

v0.23.4

21 Jun 15:09
Compare
Choose a tag to compare

Bug Fixes

1. Patch PyTorch 2.3.1 (#3419)

Fixes missing import when monkeypatching device mesh functions in PyTorch 2.3.1. This is necessary for MoE training.

Full Changelog: v0.23.3...v0.23.4

v0.23.3

21 Jun 00:18
7c7f6de
Compare
Choose a tag to compare

New Features

1. Update mlflow logger to use the new API with time-dimension to view images in MLFlow (#3286)

We've enhanced the MLflow logger's log_image function to use the new API with time-dimension support, enabling images to be viewed in MLflow.

2. Add logging buffer time to MLFLow logger (#3401)

We've added the logging_buffer_seconds argument to the MLflow logger, which specifies how many seconds to buffer before sending logs to the MLflow tracking server.

Bug Fixes

1. Only require databricks-sdk when on Databricks platform (#3389)

Previously, MLFlow always imported the databricks-sdk. Now, we only require the sdk if on the databricks platform and using databricks secrets to access managed MLFlow.

2. Skip extra dataset state load during job resumption (#3393)

Previously, when loading a checkpoint with train_dataloader, the dataset_state would load first, and if train_dataloader was set again afterward, load_state_dict would be called with a None value. Now, we've added a check in the train_dataloader setter to skip this redundant load.

3. Fix auto-microbatching on CUDA 12.4 (#3400)

In CUDA 12.4, the out-of-memory error message has changed to CUDA error: out of memory. Previously, our logic hardcoded checks for CUDA out of memory when using device_train_microbatch_size="auto". Now, we check for both CUDA out of memory and CUDA error: out of memory.

4. Fix mlflow logging to Databricks workspace file paths which startswith /Shared/ prefix (#3410)

Previously, for MLflow logging, we prepended the path /Users/ to all user-provided logging paths on the Databricks platform, if not specified, including paths starting with /Shared/, which was incorrect since /Shared/ indicates a shared workspace. Now, the /Users/ prepend is skipped for paths starting with /Shared/.

What's Changed

New Contributors

Full Changelog: v0.23.2...v0.23.3

v0.23.2

08 Jun 03:11
Compare
Choose a tag to compare

Bug Fixes

  • Fix backward compatibility issue caused by missing eval metrics class

What's Changed:

  • Fix backward compatibility issue caused by missing eval metrics class by @bigning in #3385

Full Changelog: v0.23.1...release/v0.23.2