Release v0.23.3 · mosaicml/composer

New Features

1. Update mlflow logger to use the new API with time-dimension to view images in MLFlow (#3286)

We've enhanced the MLflow logger's log_image function to use the new API with time-dimension support, enabling images to be viewed in MLflow.

2. Add logging buffer time to MLFLow logger (#3401)

We've added the logging_buffer_seconds argument to the MLflow logger, which specifies how many seconds to buffer before sending logs to the MLflow tracking server.

Bug Fixes

1. Only require `databricks-sdk` when on Databricks platform (#3389)

Previously, MLFlow always imported the databricks-sdk. Now, we only require the sdk if on the databricks platform and using databricks secrets to access managed MLFlow.

2. Skip extra dataset state load during job resumption (#3393)

Previously, when loading a checkpoint with train_dataloader, the dataset_state would load first, and if train_dataloader was set again afterward, load_state_dict would be called with a None value. Now, we've added a check in the train_dataloader setter to skip this redundant load.

3. Fix auto-microbatching on CUDA 12.4 (#3400)

In CUDA 12.4, the out-of-memory error message has changed to CUDA error: out of memory. Previously, our logic hardcoded checks for CUDA out of memory when using device_train_microbatch_size="auto". Now, we check for both CUDA out of memory and CUDA error: out of memory.

4. Fix mlflow logging to Databricks workspace file paths which startswith `/Shared/` prefix (#3410)

Previously, for MLflow logging, we prepended the path /Users/ to all user-provided logging paths on the Databricks platform, if not specified, including paths starting with /Shared/, which was incorrect since /Shared/ indicates a shared workspace. Now, the /Users/ prepend is skipped for paths starting with /Shared/.

What's Changed

Bump CI from 0.0.7 to 0.0.8 by @KuuCi in #3383
Fix backward compatibility caused by missing eval metrics class by @bigning in #3385
Bump version v0.23.2 by @bigning in #3386
Restore dev version by @bigning in #3388
Only requires databricks-sdk when inside the Databricks platform by @antoinebrl in #3389
Update packaging requirement from <24.1,>=21.3.0 to >=21.3.0,<24.2 by @dependabot in #3392
Bump cryptography from 42.0.6 to 42.0.8 by @dependabot in #3391
Skip extra dataset state load by @mvpatel2000 in #3393
Remove FSDP restriction from PyTorch 1.13 by @mvpatel2000 in #3395
Check for 'CUDA error: out of memory' when auto-microbatching by @JAEarly in #3400
Add tokens to iterations by @b-chu in #3374
Busy wait utils in dist by @dakinggg in #3396
Add buffering time to mlflow logger by @chenmoneygithub in #3401
Add missing import for PyTorch 2.3.1 device mesh slicing by @mvpatel2000 in #3402
Add pynvml to mlflow dep group by @dakinggg in #3404
min/max flagging added to system_metrics_monitor with only non-redundant, necessary gpu metrics logged by @JackZ-db in #3373
Simplify launcher world size parsing by @mvpatel2000 in #3398
Optionally use flash-attn's CE loss for metrics by @snarayan21 in #3394
log image fix by @jessechancy in #3286
[ckpt-rewr] Save state dict API by @eracah in #3372
Revert "Optionally use flash-attn's CE loss for metrics (#3394)" by @snarayan21 in #3408
CPU tests image fix by @snarayan21 in #3409
Add setter for epoch in iteration by @b-chu in #3407
Move pillow dep as required by @mvpatel2000 in #3412
fixing mlflow logging to Databricks workspace file paths with /Shared/ prefix by @JackZ-db in #3410
Bump version v0.23.3 by @karan6181 in #3414

New Contributors

@JackZ-db made their first contribution in #3373

Full Changelog: v0.23.2...v0.23.3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.23.3

New Features

1. Update mlflow logger to use the new API with time-dimension to view images in MLFlow (#3286)

2. Add logging buffer time to MLFLow logger (#3401)

Bug Fixes

1. Only require `databricks-sdk` when on Databricks platform (#3389)

2. Skip extra dataset state load during job resumption (#3393)

3. Fix auto-microbatching on CUDA 12.4 (#3400)

4. Fix mlflow logging to Databricks workspace file paths which startswith `/Shared/` prefix (#3410)

What's Changed

New Contributors

Contributors

v0.23.3

New Features

1. Update mlflow logger to use the new API with time-dimension to view images in MLFlow (#3286)

2. Add logging buffer time to MLFLow logger (#3401)

Bug Fixes

1. Only require databricks-sdk when on Databricks platform (#3389)

2. Skip extra dataset state load during job resumption (#3393)

3. Fix auto-microbatching on CUDA 12.4 (#3400)

4. Fix mlflow logging to Databricks workspace file paths which startswith /Shared/ prefix (#3410)

What's Changed

New Contributors

Contributors

1. Only require `databricks-sdk` when on Databricks platform (#3389)

4. Fix mlflow logging to Databricks workspace file paths which startswith `/Shared/` prefix (#3410)