v0.23.3
New Features
1. Update mlflow logger to use the new API with time-dimension to view images in MLFlow (#3286)
We've enhanced the MLflow logger's log_image
function to use the new API with time-dimension support, enabling images to be viewed in MLflow.
2. Add logging buffer time to MLFLow logger (#3401)
We've added the logging_buffer_seconds
argument to the MLflow logger, which specifies how many seconds to buffer before sending logs to the MLflow tracking server.
Bug Fixes
1. Only require databricks-sdk
when on Databricks platform (#3389)
Previously, MLFlow always imported the databricks-sdk. Now, we only require the sdk if on the databricks platform and using databricks secrets to access managed MLFlow.
2. Skip extra dataset state load during job resumption (#3393)
Previously, when loading a checkpoint with train_dataloader
, the dataset_state
would load first, and if train_dataloader
was set again afterward, load_state_dict
would be called with a None
value. Now, we've added a check in the train_dataloader
setter to skip this redundant load.
3. Fix auto-microbatching on CUDA 12.4 (#3400)
In CUDA 12.4, the out-of-memory error message has changed to CUDA error: out of memory
. Previously, our logic hardcoded checks for CUDA out of memory
when using device_train_microbatch_size="auto"
. Now, we check for both CUDA out of memory
and CUDA error: out of memory
.
4. Fix mlflow logging to Databricks workspace file paths which startswith /Shared/
prefix (#3410)
Previously, for MLflow logging, we prepended the path /Users/
to all user-provided logging paths on the Databricks platform, if not specified, including paths starting with /Shared/
, which was incorrect since /Shared/
indicates a shared workspace. Now, the /Users/
prepend is skipped for paths starting with /Shared/
.
What's Changed
- Bump CI from 0.0.7 to 0.0.8 by @KuuCi in #3383
- Fix backward compatibility caused by missing eval metrics class by @bigning in #3385
- Bump version v0.23.2 by @bigning in #3386
- Restore dev version by @bigning in #3388
- Only requires
databricks-sdk
when inside the Databricks platform by @antoinebrl in #3389 - Update packaging requirement from <24.1,>=21.3.0 to >=21.3.0,<24.2 by @dependabot in #3392
- Bump cryptography from 42.0.6 to 42.0.8 by @dependabot in #3391
- Skip extra dataset state load by @mvpatel2000 in #3393
- Remove FSDP restriction from PyTorch 1.13 by @mvpatel2000 in #3395
- Check for 'CUDA error: out of memory' when auto-microbatching by @JAEarly in #3400
- Add tokens to iterations by @b-chu in #3374
- Busy wait utils in dist by @dakinggg in #3396
- Add buffering time to mlflow logger by @chenmoneygithub in #3401
- Add missing import for PyTorch 2.3.1 device mesh slicing by @mvpatel2000 in #3402
- Add pynvml to mlflow dep group by @dakinggg in #3404
- min/max flagging added to system_metrics_monitor with only non-redundant, necessary gpu metrics logged by @JackZ-db in #3373
- Simplify launcher world size parsing by @mvpatel2000 in #3398
- Optionally use
flash-attn
's CE loss for metrics by @snarayan21 in #3394 - log image fix by @jessechancy in #3286
- [ckpt-rewr] Save state dict API by @eracah in #3372
- Revert "Optionally use
flash-attn
's CE loss for metrics (#3394)" by @snarayan21 in #3408 - CPU tests image fix by @snarayan21 in #3409
- Add setter for epoch in iteration by @b-chu in #3407
- Move pillow dep as required by @mvpatel2000 in #3412
- fixing mlflow logging to Databricks workspace file paths with /Shared/ prefix by @JackZ-db in #3410
- Bump version v0.23.3 by @karan6181 in #3414
New Contributors
Full Changelog: v0.23.2...v0.23.3