Releases: mosaicml/composer
v0.27.0
What's New
1. Torch 2.5.1 Compatibility (#3701)
We've added support for torch 2.5.1, including checkpointing bug fixes from PyTorch.
2. Add batch/microbatch transforms (#3703)
Sped up device transformations by doing batch transform on CPU and microbatch transforms on GPU
Deprecations and Breaking Changes
1. MLFlow Metrics Deduplication (#3678)
We added a metric de-duplication feature for the MLflow logger in Composer. Metrics that remain unchanged since the last step are not logged unless specific conditions are met, which by default is if we've reached a 100th multiple of duplicated metric steps. This optimizes logging storage by reducing redundant entries, balancing detailed sampling with efficiency.
Example:
MlflowLogger(..., log_duplicated_metric_every_n_steps=100)
What's Changed
- Metrics dedup for MLflow logger by @chenmoneygithub in #3678
- Bump databricks-sdk from 0.33.0 to 0.36.0 by @dependabot in #3686
- Update pillow requirement from <11,>=10.3.0 to >=10.3.0,<12 by @dependabot in #3684
- Lower min torchmetrics version by @mvpatel2000 in #3691
- Private link error handling by @nancyhung in #3689
- Update checkpoint tests to use new version 0.26.0 by @irenedea in #3683
- Bump coverage[toml] from 7.6.3 to 7.6.4 by @dependabot in #3694
- Pin checkpoint state dict flattening patch by @b-chu in #3700
- Torch bump to 2.5.1 by @mvpatel2000 in #3701
- Fix typo in trainer doc by @XiaohanZhangCMU in #3702
- Update packaging requirement from <24.2,>=21.3.0 to >=21.3.0,<24.3 by @dependabot in #3707
- Update torchmetrics requirement from <1.4.1,>=1.0 to >=1.0,<1.5.3 by @dependabot in #3706
- Add batch/microbatch transforms by @mvpatel2000 in #3703
- Bump version to 0.28.0.dev0 by @j316chuck in #3709
- Add torch 2.5.1 composer tests by @j316chuck in #3710
Full Changelog: v0.26.1...v0.27.0
v0.26.1
v0.26.0
What's New
1. Torch 2.5.0 Compatibility (#3609)
We've added support for torch 2.5.0, including necessary patches to Torch.
Deprecations and Breaking Changes
1. FSDP Configuration Changes(#3681)
We no longer support passing fsdp_config
and fsdp_auto_wrap
directly to Trainer
.
If you'd like to specify an fsdp config and configure fsdp auto wrapping, you should use parallelism_config
.
trainer = Trainer(
parallelism_config = {
'fsdp': {
'auto_wrap': True
...
}
}
)
2. Removal of Pytorch Legacy Sharded Checkpoint Support (#3631)
PyTorch briefly used a different sharded checkpoint format than the current one, which was quickly deprecated by PyTorch. We have removed support for this format. We initially removed support for saving in this format in #2262, and the original feature was added in #1902. Please reach out if you have concerns or need help converting your checkpoints to the new format.
What's Changed
- Add backward compatibility checkpoint tests for v0.25.0 by @dakinggg in #3635
- Don't use TP when
tensor_parallel_degree
is 1 by @eitanturok in #3636 - Update huggingface-hub requirement from <0.25,>=0.21.2 to >=0.21.2,<0.26 by @dependabot in #3637
- Update transformers requirement from !=4.34.0,<4.45,>=4.11 to >=4.11,!=4.34.0,<4.46 by @dependabot in #3638
- Bump databricks-sdk from 0.32.0 to 0.33.0 by @dependabot in #3639
- Remove Legacy Checkpointing by @mvpatel2000 in #3631
- Surface UC permission error by @b-chu in #3642
- Tensor Parallelism Tests by @eitanturok in #3620
- Switch to log.info for deterministic mode by @mvpatel2000 in #3643
- Update pre-commit requirement from <4,>=3.4.0 to >=3.4.0,<5 by @dependabot in #3645
- Update peft requirement from <0.13,>=0.10.0 to >=0.10.0,<0.14 by @dependabot in #3646
- Create callback to load checkpoint by @irenedea in #3641
- Bump jupyter from 1.0.0 to 1.1.1 by @dependabot in #3595
- Fix DB SDK Import by @mvpatel2000 in #3648
- Bump coverage[toml] from 7.6.0 to 7.6.3 by @dependabot in #3651
- Bump pypandoc from 1.13 to 1.14 by @dependabot in #3652
- Replace list with Sequence by @KuuCi in #3654
- Add better error handling for non-rank 0 during Monolithic Checkpoint Loading by @j316chuck in #3647
- Raising a better warning if train or eval did not process any data. by @ethantang-db in #3656
- Fix Logo by @XiaohanZhangCMU in #3659
- Update huggingface-hub requirement from <0.26,>=0.21.2 to >=0.21.2,<0.27 by @dependabot in #3668
- Bump cryptography from 42.0.8 to 43.0.3 by @dependabot in #3667
- Bump pytorch to 2.5.0 by @b-chu in #3663
- Don't overwrite sys.excepthook in mlflow logger by @dakinggg in #3675
- Fix pull request target by @b-chu in #3676
- Use a temp path to save local checkpoints for remote save path by @irenedea in #3673
- Loss gen tokens by @dakinggg in #3677
- Refactor
maybe_create_object_store_from_uri
by @irenedea in #3679 - Don't error if some batch slice has no loss generating tokens by @dakinggg in #3682
- Bump version to 0.27.0.dev0 by @irenedea in #3681
New Contributors
- @ethantang-db made their first contribution in #3656
Full Changelog: v0.25.0...v0.26.0
v0.25.0
What's New
1. Torch 2.4.1 Compatibility (#3609)
We've added support for torch 2.4.1, including necessary patches to Torch.
Deprecations and breaking changes
1. Microbatch device movement (#3567)
Instead of moving the entire batch to device at once, we now move each microbatch to device. This saves memory for large inputs, e.g. multimodal data, when training with many microbatches.
This change may affect certain callbacks which run operations on the batch which require it to be moved to an accelerator ahead of time, such as the two changed in this PR. There shouldn't be too many of these callbacks, so we anticipate this change will be relatively safe.
2. DeepSpeed deprecation version (#3634)
We have update the Composer version that we will remove support for DeepSpeed to 0.27.0. Please reach out on GitHub if you have any concerns about this.
3. PyTorch legacy sharded checkpoint format
PyTorch briefly used a different sharded checkpoint format than the current one, which was quickly deprecated by PyTorch. We have continued to support loading legacy format checkpoints for a while, but we will likely be removing support for this format entirely in an upcoming release. We initially removed support for saving in this format in #2262, and the original feature was added in #1902. Please reach out if you have concerns or need help converting your checkpoints to the new format.
What's Changed
- Set dev version back to 0.25.0.dev0 by @snarayan21 in #3582
- Microbatch Device Movement by @mvpatel2000 in #3567
- Init Dist Default None by @mvpatel2000 in #3585
- Explicit None Check in get_device by @mvpatel2000 in #3586
- Update protobuf requirement from <5.28 to <5.29 by @dependabot in #3591
- Bump databricks-sdk from 0.30.0 to 0.31.1 by @dependabot in #3592
- Update ci-testing to 0.2.2 by @dakinggg in #3590
- Bump Mellanox Tools by @mvpatel2000 in #3597
- Roll back ci-testing for daillies by @mvpatel2000 in #3598
- Revert driver changes by @mvpatel2000 in #3599
- Remove step in log_image for MLFlow by @mvpatel2000 in #3601
- Reduce system metrics logging frequency by @chenmoneygithub in #3604
- Bump databricks-sdk from 0.31.1 to 0.32.0 by @dependabot in #3608
- torch2.4.1 by @bigning in #3609
- Test with torch2.4.1 image by @bigning in #3610
- fix 2.4.1 test by @bigning in #3612
- Remove tensor option for _global_exception_occured by @irenedea in #3611
- Update error message for overwrite to be more user friendly by @mvpatel2000 in #3619
- Update wandb requirement from <0.18,>=0.13.2 to >=0.13.2,<0.19 by @dependabot in #3615
- Fix RNG key checking by @dakinggg in #3623
- Update datasets requirement from <3,>=2.4 to >=2.4,<4 by @dependabot in #3626
- Disable exceptions for MosaicML Logger by @mvpatel2000 in #3627
- Fix CPU dailies by @mvpatel2000 in #3628
- fix 2.4.1ckpt by @bigning in #3629
- More checkpoint debug logs by @mvpatel2000 in #3632
- Lower DeepSpeed deprecation version by @mvpatel2000 in #3634
- Bump version 25 by @dakinggg in #3633
Full Changelog: v0.24.1...v0.25.0
v0.24.1
Bug Fixes
1. Disallow passing device_mesh
to FSDPConfig
(#3580)
Explicitly errors if device_mesh
is passed to FSDPConfig
. This completes the deprecation from v0.24.0 and also addresses cases where a user specified a device mesh but it was ignored, leading to training with the incorrect parallelism style (e.g., using FSDP instead of HSDP).
What's Changed
- Bump main version to 0.25.0.dev0 by @snarayan21 in #3573
- update daily by @KevDevSha in #3572
- Bump pandoc from 2.3 to 2.4 by @dependabot in #3575
- Update transformers requirement from !=4.34.0,<4.44,>=4.11 to >=4.11,!=4.34.0,<4.45 by @dependabot in #3574
- Checkpoint backwards compatibility tests for v0.24.0 by @snarayan21 in #3579
- Error if device mesh specified in fsdp config by @snarayan21 in #3580
- Bump version to 0.24.1. by @snarayan21 in #3581
Full Changelog: v0.24.0...v0.24.1
v0.24.0
What's New
1. Torch 2.4 Compatibility (#3542, #3549, #3553, #3552, #3565)
Composer now supports Torch 2.4! We are tracking a few issues with the latest PyTorch we have raised with the PyTorch team related to checkpointing:
- [PyTorch Issue] Distributed checkpointing using PyTorch DCP has issues with stateless optimizers, e.g. SGD. We recommend using
composer.optim.DecoupledSGDW
as a workaround. - [PyTorch Issue] Distributed checkpointing using PyTorch DCP broke backwards compatibility. We have patched this using the following planner, but this may break custom planner loading.
2. New checkpointing APIs (#3447, #3474, #3488, #3452)
We've added new checkpointing APIs to download, upload, and load / save, so that checkpointing is usable outside of a Trainer
object. We will be fully migrating to these new APIs in the next minor release.
3: Improved Auto-microbatching (#3510, #3522)
We've fixed deadlocks with auto-microbatching with FSDP, bringing throughput in line with manually setting the microbatch size. This is achieved through enabling sync hooks wherever a training run might OOM to find the correct microbatch size, and disabling these hooks for the rest of training.
Bug Fixes
1. Fix checkpoint symlink uploads (#3376)
Ensures that checkpoint files are uploaded before the symlink file, fixing errors with missing or incomplete checkpoints.
2. Optimizer tracks same parameters after FSDP wrapping (#3502)
When only a subset of parameters should be tracked by the optimizer, FSDP wrapping will now not interfere.
What's Changed
- Bump ipykernel from 6.29.2 to 6.29.5 by @dependabot in #3459
- Update torchmetrics requirement from <1.3.3,>=0.10.0 to >=1.4.0.post0,<1.4.1 by @dependabot in #3460
- [Checkpoint] Fix symlink issue where symlink file uploaded before checkpoint files upload by @bigning in #3376
- Bump databricks-sdk from 0.28.0 to 0.29.0 by @dependabot in #3456
- Remove Log Exception by @jjanezhang in #3464
- Corrected docs for MFU in SpeedMonitor by @JackZ-db in #3469
- [checkpoint v2] Download api by @bigning in #3447
- Upload api by @bigning in #3474
- [Checkpoint V2] Upload API by @bigning in #3488
- Load api by @eracah in #3452
- Add helpful comment explaining HSDP initialization seeding by @mvpatel2000 in #3470
- Add fit start to mosaicmllogger by @ethanma-db in #3467
- Remove OOM-Driven FSDP Deadlocks and Increase Throughput of Automicrobatching by @JackZ-db in #3510
- Move hooks and fsdp modules onto state rather than trainer by @JackZ-db in #3522
- Bump coverage[toml] from 7.5.4 to 7.6.0 by @dependabot in #3471
- revert a wip PR by @bigning in #3475
- Change FP8 Eval to default to activation dtype by @j316chuck in #3454
- Get a shared file system safe signal file name by @dakinggg in #3485
- Bumping flash attention version to v2.6.2 by @ShashankMosaicML in #3489
- Bump to Pytorch 2.4 by @mvpatel2000 in #3542
- Add Torch 2.4 Tests by @mvpatel2000 in #3549
- Fix torch 2.4 images for tests by @snarayan21 in #3553
- Fix torch 2.4 tests by @mvpatel2000 in #3552
- Fix bug when subset of model parameters is passed into optimizer with FSDP by @sashaDoubov in #3502
- Correctly process
parallelism_config['tp']
when it's a dict by @snarayan21 in #3434 - [torch2.4] Fix sharded checkpointing backward compatibility issue by @bigning in #3565
- [fix-daily] Use composer get_model_state_dict instead of torch's by @eracah in #3492
- Load Microbatches instead of Entire Batches to GPU by @JackZ-db in #3487
- Make Pytest log in color in Github Action by @eitanturok in #3505
- Revert "Load Microbatches instead of Entire Batches to GPU " by @JackZ-db in #3508
- Bump transformers version by @dakinggg in #3511
- Fix FSDP Config Validation by @mvpatel2000 in #3530
- Add FSDP input validation for use_orig_params and activation_cpu_offload flag by @j316chuck in #3515
- Fix checkpoint events by @b-chu in #3468
- Patch conf.py for readthedocs sphinx injection deprecation. by @mvpatel2000 in #3491
- save load path in state and pass to mosaicmllogger by @ethanma-db in #3506
- Disable gcs azure daily test by @bigning in #3514
- Update huggingface-hub requirement from <0.24,>=0.21.2 to >=0.21.2,<0.25 by @dependabot in #3481
- restore version on dev by @XiaohanZhangCMU in #3451
- Deprecate deepspeed by @dakinggg in #3512
- Update importlib-metadata requirement from <7,>=5.0.0 to >=5.0.0,<9 by @dependabot in #3519
- Update peft requirement from <0.12,>=0.10.0 to >=0.10.0,<0.13 by @dependabot in #3518
- Use gloo as part of DeviceGPU's process group backend by @snarayan21 in #3509
- Add a monitor of mlflow logger so that it sets run status as failed if main thread exits unexpectedly by @chenmoneygithub in #3449
- Revert "Use gloo as part of DeviceGPU's process group backend (#3509)" by @snarayan21 in #3523
- Fix autoresume docstring (save_overwrite) by @eracah in #3526
- Unpin pip by @dakinggg in #3524
- hasattr check for Wandb 0.17.6 by @mvpatel2000 in #3531
- Remove dev on github workflows by @mvpatel2000 in #3536
- Remove dev branch in GPU workflows by @mvpatel2000 in #3539
- restore google cloud object store test by @bigning in #3538
- Update moto[s3] requirement from <5,>=4.0.1 to >=4.0.1,<6 by @dependabot in #3516
- use s3 boto3 Adaptive retry as default retry mode by @bigning in #3543
- Use python 3.11 in GAs by @eitanturok in #3529
- Implement ruff rules enforcing pep 585 by @snarayan21 in #3551
- Update numpy requirement from <2.1.0,>=1.21.5 to >=1.21.5,<2.2.0 by @dependabot in #3556
- Bump databricks-sdk from 0.29.0 to 0.30.0 by @dependabot in #3559
- Update Optim to DecoupledSGD in Notebooks by @mvpatel2000 in #3554
- Remove lambda code eval testing by @mvpatel2000 in #3560
- Restore Azure Tests by @mvpatel2000 in #3561
- Remove tokens for
to_next_epoch
by @mvpatel2000 in #3562 - Change iteration timestamp for old checkpoints by @b-chu in #3563
- Fix typo in
composer_collect_env
by @dakinggg in #3566 - Add default value to get_device() by @coryMosaicML in #3568
- add ghcr and update build matrix generator by @KevDevSha in #3465
- Bump aws_ofi_nccl to 1.11.0 by @willgleich in #3569
- allow listed runners by @KevDevSha in #3486
- fix runner linux-ubuntu > ubuntu-latest by @KevDevSha in #3571
- Bump version to v0.24.0 + deprecations by @snarayan21 in https://github.co...
v0.23.5
What's New
1. Variable length dataloaders (#3416)
Adds support for dataloaders with rank-dependent lengths. The solution terminates iteration for dataloaders on all ranks when the first dataloader finishes.
Bug Fixed
1. Remove close flush for mosaicml logger (#3446)
Previously, the MosaicML Logger sporadically raised an error when the python interpreter was shutting down as it attempted to flush data on Event.CLOSE
using futures, which cannot be scheduled at that time. Instead, we now only block on finishing existing data upload on Event.CLOSE
, avoiding scheduling new futures.
What's Changed
- Update numpy requirement from <1.27.0,>=1.21.5 to >=1.21.5,<2.1.0 by @dependabot in #3406
- Restore dev version by @karan6181 in #3417
- Save checkpoint to disk for API with new save layout by @eracah in #3399
- Patch PyTorch 2.3.1 by @mvpatel2000 in #3419
- Fixes some typing issues by @dakinggg in #3418
- Fix style by @b-chu in #3420
- Bump coverage[toml] from 7.5.3 to 7.5.4 by @dependabot in #3422
- Update psutil requirement from <6,>=5.8.0 to >=5.8.0,<7 by @dependabot in #3424
- Add support for variable length dataloaders in DDP by @JAEarly in #3416
- Hsdp + MoE CI tests by @KuuCi in #3378
- Bumping MLflow version to 2.14.1 by @JackZ-db in #3425
- Skip HSDP + TP pytests that require torch 2.3 or above by @KuuCi in #3426
- Remove CodeQL workflow by @mvpatel2000 in #3429
- Remove save overwrite by @mvpatel2000 in #3431
- Fixes to TP Docs by @snarayan21 in #3430
- Lower the system metrics logging frequency to reduce MLflow server's load by @chenmoneygithub in #3436
- Update paramiko requirement from <3,>=2.11.0 to >=3.4.0,<4 by @dependabot in #3439
- Bump CI testing version by @mvpatel2000 in #3433
- Fix docstring for EVAL_AFTER_ALL/EVAL_BEFORE_ALL by @mvpatel2000 in #3445
- Remove close flush for mosaicml logger by @mvpatel2000 in #3446
- Remove MosaicMLLambdaEvalClient by @aspfohl in #3432
- Relax hf hub pin by @dakinggg in #3435
- Pytest skip 2 by @KuuCi in #3448
- bump version v0.23.5 by @XiaohanZhangCMU in #3450
Full Changelog: v0.23.4...v0.23.5
v0.23.4
Bug Fixes
1. Patch PyTorch 2.3.1 (#3419)
Fixes missing import when monkeypatching device mesh functions in PyTorch 2.3.1. This is necessary for MoE training.
Full Changelog: v0.23.3...v0.23.4
v0.23.3
New Features
1. Update mlflow logger to use the new API with time-dimension to view images in MLFlow (#3286)
We've enhanced the MLflow logger's log_image
function to use the new API with time-dimension support, enabling images to be viewed in MLflow.
2. Add logging buffer time to MLFLow logger (#3401)
We've added the logging_buffer_seconds
argument to the MLflow logger, which specifies how many seconds to buffer before sending logs to the MLflow tracking server.
Bug Fixes
1. Only require databricks-sdk
when on Databricks platform (#3389)
Previously, MLFlow always imported the databricks-sdk. Now, we only require the sdk if on the databricks platform and using databricks secrets to access managed MLFlow.
2. Skip extra dataset state load during job resumption (#3393)
Previously, when loading a checkpoint with train_dataloader
, the dataset_state
would load first, and if train_dataloader
was set again afterward, load_state_dict
would be called with a None
value. Now, we've added a check in the train_dataloader
setter to skip this redundant load.
3. Fix auto-microbatching on CUDA 12.4 (#3400)
In CUDA 12.4, the out-of-memory error message has changed to CUDA error: out of memory
. Previously, our logic hardcoded checks for CUDA out of memory
when using device_train_microbatch_size="auto"
. Now, we check for both CUDA out of memory
and CUDA error: out of memory
.
4. Fix mlflow logging to Databricks workspace file paths which startswith /Shared/
prefix (#3410)
Previously, for MLflow logging, we prepended the path /Users/
to all user-provided logging paths on the Databricks platform, if not specified, including paths starting with /Shared/
, which was incorrect since /Shared/
indicates a shared workspace. Now, the /Users/
prepend is skipped for paths starting with /Shared/
.
What's Changed
- Bump CI from 0.0.7 to 0.0.8 by @KuuCi in #3383
- Fix backward compatibility caused by missing eval metrics class by @bigning in #3385
- Bump version v0.23.2 by @bigning in #3386
- Restore dev version by @bigning in #3388
- Only requires
databricks-sdk
when inside the Databricks platform by @antoinebrl in #3389 - Update packaging requirement from <24.1,>=21.3.0 to >=21.3.0,<24.2 by @dependabot in #3392
- Bump cryptography from 42.0.6 to 42.0.8 by @dependabot in #3391
- Skip extra dataset state load by @mvpatel2000 in #3393
- Remove FSDP restriction from PyTorch 1.13 by @mvpatel2000 in #3395
- Check for 'CUDA error: out of memory' when auto-microbatching by @JAEarly in #3400
- Add tokens to iterations by @b-chu in #3374
- Busy wait utils in dist by @dakinggg in #3396
- Add buffering time to mlflow logger by @chenmoneygithub in #3401
- Add missing import for PyTorch 2.3.1 device mesh slicing by @mvpatel2000 in #3402
- Add pynvml to mlflow dep group by @dakinggg in #3404
- min/max flagging added to system_metrics_monitor with only non-redundant, necessary gpu metrics logged by @JackZ-db in #3373
- Simplify launcher world size parsing by @mvpatel2000 in #3398
- Optionally use
flash-attn
's CE loss for metrics by @snarayan21 in #3394 - log image fix by @jessechancy in #3286
- [ckpt-rewr] Save state dict API by @eracah in #3372
- Revert "Optionally use
flash-attn
's CE loss for metrics (#3394)" by @snarayan21 in #3408 - CPU tests image fix by @snarayan21 in #3409
- Add setter for epoch in iteration by @b-chu in #3407
- Move pillow dep as required by @mvpatel2000 in #3412
- fixing mlflow logging to Databricks workspace file paths with /Shared/ prefix by @JackZ-db in #3410
- Bump version v0.23.3 by @karan6181 in #3414
New Contributors
Full Changelog: v0.23.2...v0.23.3