Releases · mosaicml/composer

02 Feb 09:07

j316chuck

v0.19.0

de89606

v0.19.0

What's New

1. Improved DTensor Support

Composer now supports elastic saving and loading of DTensors at various mesh sizes.

2. Checkpoint Saving and Loading from Databricks MLFlow

Composer now supports saving and loading checkpoints to Databricks-managed MLFlow.

composer_model = MyComposerModel(...)

trainer = Trainer(
      model=composer_model,
      save_folder= 'dbfs:/databricks/mlflow-tracking/{mlflow_experiment_id}/{mlflow_run_id}/artifacts',
      logger=MLFlowLogger(...),
      load_path= 'dbfs:/databricks/mlflow-tracking/{mlflow_experiment_id}/{mlflow_run_id}/artifacts',
      ...
)

3. Better Communication Computation Overlap in FSDP

Composer now has improved communication/computation overlap in our FSDP code which should improve MFU across several architectures.

4. Python3.11 + Torch2.2 Support

Initial support of Python3.11 + Torch2.2 added in Composer.

5. PEFT LoRA

PEFT LoRA is now supported in the HuggingFaceModel class.

6. Refactored Evaluation

in_context_learning_evaluation.py has a new design with cleaner abstractions and easier interfaces to work wtih.

7. Azure Checkpointing

Composer now supports saving your model in Azure.

8. MLFlow Checkpointing

Composer now supports saving your model in MLFlow.

Bug Fixes

Fix MLFlowLogger test by @ngcgarcia in #2912
Fix bug with CoT early stopping and LLama2 tokenizer by @bmosaicml in #2902
Fix split_batch bug with empty generation_kwargs by @maxisawesome in #2913
Only load RNG keys that exist by @mvpatel2000 in #2901
Fix daily tests by @mvpatel2000 in #2891
Fix seed for FSDP wrap by @mvpatel2000 in #2833
Fix load_ignore_keys with rng by @mvpatel2000 in #2803
Fix mosaicml logger on close by @mvpatel2000 in #2816
Fix torch profiler error on close by @mvpatel2000 in #2818
Fix import for daily test by @snarayan21 in #2826
Fix how single value tensors are logged by @aspfohl in #2831
Fix torch bump by @j316chuck in #2855
Fix MPS with sequence loss by @JAEarly in #2834

What's Changed

Bump transformers version by @dakinggg in #2781
Bump sphinxext-opengraph from 0.9.0 to 0.9.1 by @dependabot in #2784
Bump coverage[toml] from 7.3.0 to 7.3.3 by @dependabot in #2783
Update torch requirement from <2.1.2,>=1.13.1 to >=1.13.1,<2.1.3 by @dependabot in #2785
[UCVolumes] Rely on databricks-sdk auth for the right requirements by @panchalhp-db in #2789
Enable system metrics in mosaic mlflow logger by @chenmoneygithub in #2775
Update parse_uri by @irenedea in #2787
default to no torch profiler memory timeline by @cli99 in #2790
Add eot token to ICL generate kwargs by @bmosaicml in #2782
Add nightly image for torch 2.2.0-12-20-23 by @j316chuck in #2791
Add torch nightly 12-13 by @j316chuck in #2792
Add process group as arg to FSDP by @mvpatel2000 in #2794
Bump coverage[toml] from 7.3.3 to 7.3.4 by @dependabot in #2798
Bump ipykernel from 6.26.0 to 6.28.0 by @dependabot in #2806
Bump junitparser from 3.1.0 to 3.1.1 by @dependabot in #2805
Bump pytest from 7.4.3 to 7.4.4 by @dependabot in #2807
Avoid futures on close for MosaicML logger by @mvpatel2000 in #2804
Require sync module states with HSDP by @mvpatel2000 in #2812
Better communication computation overlap by @snarayan21 in #2811
Improve error message for speed monitor by @mvpatel2000 in #2801
Bump torch version -- DO NOT RELEASE by @mvpatel2000 in #2814
Bump torchvision for nightly by @mvpatel2000 in #2815
Correct multi-unshard stream patching for torch 2.2.0dev, and stream waiting correctness. by @snarayan21 in #2817
Bump traitlets from 5.13.0 to 5.14.1 by @dependabot in #2822
All unshard streams wait on computation every step by @snarayan21 in #2823
Add encoding=utf-8 by @dakinggg in #2824
[MLFlowObjectStore] [1/2] Base implementation for MLFlowObjectStore by @jerrychen109 in #2802
Remove fused layernorm (already deprecated for 2 versions) by @mvpatel2000 in #2827
checkpoint saver tracks all checkpoints/intervals in state by @aspfohl in #2819
code-quality timeout update by @aspfohl in #2830
Adds DTensor Support by @mvpatel2000 in #2821
Remove duplicate checkpoint verifications by @eracah in #2828
Remove fsdp patch for comm overlap by @mvpatel2000 in #2836
Allow hsdp by @mvpatel2000 in #2838
Bump torch 2.1.2 by @mvpatel2000 in #2840
Upgrade pyright to 1.1.310 by @b-chu in #2841
[MLFlowObjectStore] [2/2] Support checkpointing with MLFlow by @jerrychen109 in #2810
update nightly to torch 2.3 by @j316chuck in #2842
Pin sphinxcontrib applehelp by @mvpatel2000 in #2854
Torch 2.3 patch by @dakinggg in #2849
Update mosaicml-cli requirement from <0.6,>=0.5.25 to >=0.5.25,<0.7 by @dependabot in #2866
Rewrite to use individual state functions by @mvpatel2000 in #2860
Add custom stopping criteria to ICL generate tasks by @bmosaicml in #2800
Add save_ignore_keys by @mvpatel2000 in #2868
Remome log debug by @mvpatel2000 in #2871
Update monkeypatch to put barrier in optim load by @mvpatel2000 in #2874
Remove toml by @b-chu in #2872
Update license by @b-chu in #2875
Add ignore_metrics field to the MLflow logger by @ngcgarcia in #2869
Convert print to log.info by @mvpatel2000 in #2876
Bump version to 0.18.0 by @irenedea in #2877
Removed commented-out unshard streams patching. by @snarayan21 in #2873
Make code quality workflow reusable by @b-chu in #2878
Bump gitpython from 3.1.40 to 3.1.41 by @dependabot in #2885
Bump torchmetrics by @mvpatel2000 in #2890
Bump transformers to 4.37 by @dakinggg in #2894
Azure checkpointing support by @mvpatel2000 in #2893
Pass PG into checkpoint load and load rng with state_dict by @mvpatel2000 in #2897
Remove monkeypatch and new state dict APIs for torch 2.2 by @mvpatel2000 in #2899
Bump version to 0.18.1 by @b-chu in #2905
Refactor in_context_learning_evaluation.py by @maxisawesome in #2713
Fix FP8 checkpoint resumption with onnx export flag by @j316chuck in #2907
Add Python 3.11 + FA 2.5.0 + Torch 2.3.0 Image by @KuuCi in #2898
Add yamllint to pre commit by @b-chu in #2909
Add ignore_hyperparameters to MLFlowLogger by @ngcgarcia in #2908
Bump coverage[toml] from 7.3.4 to 7.4.1 by @dependabot in #2915
Add checkpoint test for 0.18.1 by @b-chu in #2906
Integrate PEFT LoRA with HuggingFaceModel by @dakinggg in #2829

New Contributors

@jerrychen109 made their first contribution in #2802
@JAEarly made their first contribution in https://github.com/mosa...

Contributors

eracah, JAEarly, and 16 other contributors

Assets 2

01 Feb 04:48

b-chu

v0.18.2

cc5e35b

v0.18.2

Bug Fixes

Fix lp layernorm weight by @snarayan21 in #2954

What's Changed

Fix lp layernorm weight by @snarayan21 in #2954
Bump version to 0.18.2 by @b-chu

Full Changelog: v0.18.1...v0.18.2

Contributors

b-chu and snarayan21

Assets 2

30 Jan 23:10

b-chu

v0.18.1

79e3309

v0.18.1

Bug Fixes

Fix MPS with sequence loss by @JAEarly in #2834
Fix daily tests by @mvpatel2000 in #2891
Remove monkeypatch and new state dict APIs for torch 2.2 by @mvpatel2000 in #2899
Only load RNG keys that exist by @mvpatel2000 in #2901

What's Changed

Bump version to 0.18.0 by @irenedea in #2877
Removed commented-out unshard streams patching. by @snarayan21 in #2873
Make code quality workflow reusable by @b-chu in #2878
Bump gitpython from 3.1.40 to 3.1.41 by @dependabot in #2885
Fix MPS with sequence loss by @JAEarly in #2834
Bump torchmetrics by @mvpatel2000 in #2890
Fix daily tests by @mvpatel2000 in #2891
Bump transformers to 4.37 by @dakinggg in #2894
Azure checkpointing support by @mvpatel2000 in #2893
Pass PG into checkpoint load and load rng with state_dict by @mvpatel2000 in #2897
Remove monkeypatch and new state dict APIs for torch 2.2 by @mvpatel2000 in #2899
Only load RNG keys that exist by @mvpatel2000 in #2901
Bump version to 0.18.1 by @b-chu in #2905

New Contributors

@JAEarly made their first contribution in #2834

Full Changelog: v0.18.0...v0.18.1

Contributors

JAEarly, irenedea, and 5 other contributors

Assets 2

25 Jan 20:44

b-chu

v0.18.0

844d1dc

v0.18.0

This release has been yanked, please skip directly to Composer v0.18.1

New Features

1. Improved DTensor Support

Composer now supports elastic saving and loading of DTensors at various mesh sizes.

2. Checkpoint Saving and Loading from Databricks MLFlow

Composer now supports saving and loading checkpoints to Databricks-managed MLFlow.

composer_model = MyComposerModel(...)

trainer = Trainer(
      model=composer_model,
      save_folder= 'dbfs:/databricks/mlflow-tracking/{mlflow_experiment_id}/{mlflow_run_id}/artifacts',
      logger=MLFlowLogger(...),
      load_path= 'dbfs:/databricks/mlflow-tracking/{mlflow_experiment_id}/{mlflow_run_id}/artifacts',
      ...
)

Bug Fixes

Fix load_ignore_keys with rng by @mvpatel2000 in #2803
Fix mosaicml logger on close by @mvpatel2000 in #2816
Fix torch profiler error on close by @mvpatel2000 in #2818
Fix import for daily test by @snarayan21 in #2826
[S] Fix how single value tensors are logged by @aspfohl in #2831

Deprecations

Remove fused layernorm (already deprecated for 2 versions) by @mvpatel2000 in #2827

What's Changed

Bump transformers version by @dakinggg in #2781
Bump sphinxext-opengraph from 0.9.0 to 0.9.1 by @dependabot in #2784
Bump coverage[toml] from 7.3.0 to 7.3.3 by @dependabot in #2783
Update torch requirement from <2.1.2,>=1.13.1 to >=1.13.1,<2.1.3 by @dependabot in #2785
[UCVolumes] Rely on databricks-sdk auth for the right requirements by @panchalhp-db in #2789
Enable system metrics in mosaic mlflow logger by @chenmoneygithub in #2775
Update parse_uri by @irenedea in #2787
default to no torch profiler memory timeline by @cli99 in #2790
Add eot token to ICL generate kwargs by @bmosaicml in #2782
Add nightly image for torch 2.2.0-12-20-23 by @j316chuck in #2791
Add torch nightly 12-13 by @j316chuck in #2792
Add process group as arg to FSDP by @mvpatel2000 in #2794
Bump coverage[toml] from 7.3.3 to 7.3.4 by @dependabot in #2798
Fix load_ignore_keys with rng by @mvpatel2000 in #2803
Bump ipykernel from 6.26.0 to 6.28.0 by @dependabot in #2806
Bump junitparser from 3.1.0 to 3.1.1 by @dependabot in #2805
Bump pytest from 7.4.3 to 7.4.4 by @dependabot in #2807
Avoid futures on close for MosaicML logger by @mvpatel2000 in #2804
Require sync module states with HSDP by @mvpatel2000 in #2812
Better communication computation overlap by @snarayan21 in #2811
Improve error message for speed monitor by @mvpatel2000 in #2801
Bump torch version -- DO NOT RELEASE by @mvpatel2000 in #2814
Bump torchvision for nightly by @mvpatel2000 in #2815
Fix mosaicml logger on close by @mvpatel2000 in #2816
Correct multi-unshard stream patching for torch 2.2.0dev, and stream waiting correctness. by @snarayan21 in #2817
Fix torch profiler error on close by @mvpatel2000 in #2818
Bump traitlets from 5.13.0 to 5.14.1 by @dependabot in #2822
All unshard streams wait on computation every step by @snarayan21 in #2823
Add encoding=utf-8 by @dakinggg in #2824
Fix import for daily test by @snarayan21 in #2826
[MLFlowObjectStore] [1/2] Base implementation for MLFlowObjectStore by @jerrychen109 in #2802
Remove fused layernorm (already deprecated for 2 versions) by @mvpatel2000 in #2827
checkpoint saver tracks all checkpoints/intervals in state by @aspfohl in #2819
code-quality timeout update by @aspfohl in #2830
[S] Fix how single value tensors are logged by @aspfohl in #2831
Adds DTensor Support by @mvpatel2000 in #2821
Remove duplicate checkpoint verifications by @eracah in #2828
Fix seed for FSDP wrap by @mvpatel2000 in #2833
Remove fsdp patch for comm overlap by @mvpatel2000 in #2836
Allow hsdp by @mvpatel2000 in #2838
Bump torch 2.1.2 by @mvpatel2000 in #2840
Upgrade pyright to 1.1.310 by @b-chu in #2841
[MLFlowObjectStore] [2/2] Support checkpointing with MLFlow by @jerrychen109 in #2810
update nightly to torch 2.3 by @j316chuck in #2842
Pin sphinxcontrib applehelp by @mvpatel2000 in #2854
Fix torch bump by @j316chuck in #2855
Torch 2.3 patch by @dakinggg in #2849
Update mosaicml-cli requirement from <0.6,>=0.5.25 to >=0.5.25,<0.7 by @dependabot in #2866
Rewrite to use individual state functions by @mvpatel2000 in #2860
Add custom stopping criteria to ICL generate tasks by @bmosaicml in #2800
Add save_ignore_keys by @mvpatel2000 in #2868
Remome log debug by @mvpatel2000 in #2871
Update monkeypatch to put barrier in optim load by @mvpatel2000 in #2874
Remove toml by @b-chu in #2872
Update license by @b-chu in #2875
Add ignore_metrics field to the MLflow logger by @ngcgarcia in #2869
Convert print to log.info by @mvpatel2000 in #2876

New Contributors

@jerrychen109 made their first contribution in #2802

Full Changelog: v0.17.2...v0.18.0

Contributors

eracah, jerrychen109, and 13 other contributors

Assets 2

14 Dec 20:02

mvpatel2000

v0.17.2

7e0e40a

v0.17.2

New Features

1. Torch 2.1.1 Support

Composer now supports torch 2.1.1! This new release primarily fixes several small bugs that we had previously monkeypatched in Composer.

2. Faster OCI Upload/Download

Composer now supports multi-part upload/download to OCI, which should speedup object store times.

3. Memory Profiling

We've expanded the torch profiler integration to support memory profiling. Now, when the profile is enabled, you will get a trace showing how memory utilization is broken down by various components on your GPUs.

Bug Fixes

1. FSDP Initialization with Meta

Previously, our FSDP integration had a bug with initializing weights when using device=meta, which resulted in an additional scaling. This has now been fixed, so device and distributed strategies should not affect parallelization strategy.

What's Changed

Override NVIDIA environment variable for CUDA 12.1 images by @bandish-shah in #2742
Add NVIDIA_REQUIRE_CUDA_OVERRIDE env variable to Composer and Torch nightly Docker images by @bandish-shah in #2744
Remove duplicated for loop in lr_monitor.py by @priba in #2738
Fix console logger for small datasets. by @mvpatel2000 in #2746
Add metadata logging for wandb by @jjanezhang in #2747
Ignore load ignore keys by @mvpatel2000 in #2748
Bump torch to 2.1.1 version by @j316chuck in #2717
Add more info when run doesnt complete by @aspfohl in #2751
Lower sequence generation length on code gen to be dependent on max canonical solution length by @bmosaicml in #2682
Remove flatten params by @mvpatel2000 in #2761
Fix GPU tests by @mvpatel2000 in #2767
Fix GPU v2 by @mvpatel2000 in #2768
Use time.tokens for speedmonitor instead of dataset length by @mvpatel2000 in #2762
Remove BreakEpochException by @mvpatel2000 in #2759
time to clean up time parsing 😉 by @aspfohl in #2770
Upgrade RunConfig compute specification by @aspfohl in #2772
Use async logging in MLflowLogger by @chenmoneygithub in #2693
Fix FSDP _param_init_fn to not reinit parameters multiple times by @dakinggg in #2765
Gate FSDP param init test on torch 2.1 by @dakinggg in #2774
Parallelize OCI multipart download by @coryMosaicML in #2750
[UCVolumes] Add support for list API by @panchalhp-db in #2769
Add the memory timeline profiling support through the PyTorch profiler. by @cli99 in #2771
Improve torch memory profiling arguments processing by @cli99 in #2777
Bump aws of nccl version and enable aws platform support by @willgleich in #2776
Extend checkpoint loading to accept a validation function by @irenedea in #2726
Fix checkpoint validation tests for torch 1.13 by @irenedea in #2779
Bump version to 0.17.2 by @mvpatel2000 in #2780

New Contributors

@chenmoneygithub made their first contribution in #2693

Full Changelog: v0.17.1...v0.17.2

Contributors

j316chuck, irenedea, and 12 other contributors

Assets 2

27 Nov 22:07

mvpatel2000

v0.17.1

2b3e2a6

v0.17.1

Bug Fixes

1. MosaicML Logger Robustness (#2728)

We've improved the MosaicML logger to be more robust to faulty serialization.

What's Changed

Add train finished run event by @jjanezhang in #2714
Override nvidia env var for 11.8 by @dakinggg in #2722
Update file exists checkpointing error messages to be more helpful by @irenedea in #2668
[S] Add tag support to MLFlowLogger by @aspfohl in #2716
Use raise ... from e to preserve stack trace by @irenedea in #2725
add 0.17 to bcompat tests by @eracah in #2723
Add support for canned ACL environment variable by @nik-mosaic in #2729
Check serialization for JSON in mosaicml logger by @mvpatel2000 in #2728
Fix profiler issue by @j316chuck in #2735
Fix activation cpu offloading by @cli99 in #2724
Bump version 0.17.1 by @mvpatel2000 in #2741

Full Changelog: v0.17.0...v0.17.1

Contributors

eracah, j316chuck, and 7 other contributors

Assets 2

16 Nov 00:23

mvpatel2000

v0.17.0

83a40f5

v0.17.0

What's New

1. Hybrid Sharded Data Parallel (HSDP) Integration (#2648)

Composer now supports Hybrid Sharded Data Parallel (HSDP), where a model is both sharded and replicated across blocks of controllable size. By default, this will shard a model within a node and replicate across nodes, but Composer will accept a tuple of process groups to specify custom shard/replicate sizes. This can be specified in the FSDP config.

  composer_model = MyComposerModel(n_layers=3)

  fsdp_config = {
      'sharding_strategy': 'HYBRID_SHARD',
  }

  trainer = Trainer(
      model=composer_model,
      max_duration='4ba',
      fsdp_config=fsdp_config,
      ...
  )

HYBRID_SHARD will FULL_SHARD a model whereas _HYBRID_SHARD_ZERO2 will SHARD_GRAD_OP within the shard block.

2. Train Loss NaN Monitor (#2704)

Composer has a new callback which will raise a value error if your loss NaNs out. This is very useful to avoid wasting compute if your training run diverges or fails for numerical reasons.

  from composer.callbacks import NaNMonitor

  composer_model = MyComposerModel(n_layers=3)

  trainer = Trainer(
      model=composer_model,
      max_duration='4ba',
      callbacks=NaNMonitor(),
      ...
  )

Bug Fixes

Fix MPS with dict loss by @mvpatel2000 in #2706
Squelch Memory Monitor warnings if device=meta by @hanlint in #2529
Switch mosaicml logger to use futures to enable better error handling by @j316chuck in #2702

What's Changed

Add partial state dict functionality for FSDP by @b-chu in #2637
Update monai requirement from <1.3,>=0.9.1 to >=0.9.1,<1.4 by @dependabot in #2643
Bump pytest-codeblocks from 0.16.1 to 0.17.0 by @dependabot in #2645
Remove checkpoint on close by @mvpatel2000 in #2646
Update latest to 2.1 by @mvpatel2000 in #2650
HSDP Support by @mvpatel2000 in #2648
Log profile averages by @j316chuck in #2647
Daily API key by @mvpatel2000 in #2655
Add automatic remote uploader downloader for composer profiler by @j316chuck in #2653
Update the AWS_OFI_NCCL version and add in the MPI HWLOC install by @willgleich in #2651
Fix GCP tests by @mvpatel2000 in #2658
Allow no eval_loader when eval is disabled by @b-chu in #2657
Gate HSDP by torch 2.1.0 by @mvpatel2000 in #2656
Fix FSDP arg default to match torch by @mvpatel2000 in #2660
Bump pypandoc from 1.11 to 1.12 by @dependabot in #2664
Bump vit-pytorch from 0.35.8 to 1.6.1 by @dependabot in #2662
Upgrade to transformers 4.34.1 by @dakinggg in #2635
Update docker readme by @mvpatel2000 in #2669
Add script to validate remote object store paths by @irenedea in #2667
Torch 2.1 Resumption Support by @mvpatel2000 in #2665
Bump gitpython from 3.1.37 to 3.1.40 by @dependabot in #2663
Fix dist by @mvpatel2000 in #2670
Add torch nightly for torch 2.2.0 10-24 by @j316chuck in #2671
Adding Model Data Init and Training Progress to MosaicMLLogger by @jjanezhang in #2633
Bump pytest from 7.4.2 to 7.4.3 by @dependabot in #2678
Bump sphinxext-opengraph from 0.8.2 to 0.9.0 by @dependabot in #2677
Bump traitlets from 5.10.0 to 5.12.0 by @dependabot in #2674
Bump cryptography from 41.0.4 to 41.0.5 by @dependabot in #2675
Secure Code Eval changes by @mvpatel2000 in #2679
Lazy validation of code eval metric by @mvpatel2000 in #2681
Upgrade transformers to 4.35 by @dakinggg in #2684
Bump traitlets from 5.12.0 to 5.13.0 by @dependabot in #2687
Bump ipykernel from 6.25.2 to 6.26.0 by @dependabot in #2686
Add Kwargs to upload_object by @nik-mosaic in #2692
Add version number to composer metadata logs by @j316chuck in #2565
Add distributed barrier test fixture to ensure pytest cleans up resources properly by @j316chuck in #2694
Properly handle empty metric_names passed to Trainer._filter_metrics by @irenedea in #2700
Train loss NaN checking callback by @coryMosaicML in #2704
Adding logging and force flushing for run events by @jjanezhang in #2703
[daily-test fix] Add rank 0 gating to test_elastic_resumption state dict comparison by @eracah in #2705
Fix MPS with dict loss by @mvpatel2000 in #2706
Update types to follow PEP 585 by @b-chu in #2697
Bump yamllint from 1.32.0 to 1.33.0 by @dependabot in #2708
Update wandb requirement from <0.16,>=0.13.2 to >=0.13.2,<0.17 by @dependabot in #2709
Squelch Memory Monitor warnings if device=meta by @hanlint in #2529
Fix NaN monitor for loss dicts. by @coryMosaicML in #2712
Switch mosaicml logger to use futures to enable better error handling by @j316chuck in #2702
Fetching arguments for FSDP by @mvpatel2000 in #2710
Bump version to 0.17 by @mvpatel2000 in #2711

New Contributors

@willgleich made their first contribution in #2651
@jjanezhang made their first contribution in #2633

Full Changelog: v0.16.4...v0.17.0

Contributors

eracah, j316chuck, and 10 other contributors

Assets 2

11 Oct 19:49

mvpatel2000

v0.16.4

1c9d8d1

v0.16.4

What's New

1. Torch 2.1 Support

Composer officially supports PyTorch 2.1! We support several new features from 2.1, including CustomPolicy which supports granular wrapping with FSDP.

What's Changed

Add 0.16 checkpoint to backwards compatibility tests by @eracah in #2567
Updating FSDP monkeypatch by @mvpatel2000 in #2571
Add Databricks UC Volume Object Store by @panchalhp-db in #2548
Fix pytest disk space OOM issue by adding tmp_path_retention_policy=None by @j316chuck in #2583
Change daily nightly test version by @j316chuck in #2596
Add save and register wrappers to mlflow logger by @dakinggg in #2579
Missing () fo or in auto microbatching gate by @mvpatel2000 in #2574
Simplify FSDP Gradient Clipping by @mvpatel2000 in #2586
Use FSDP CustomPolicy to support custom kwargs passed to different wrapped modules by @cli99 in #2585
Free outputs callback by @mvpatel2000 in #2598
Merge branch 'dev' into spr/dev/458c4e36 by @b-chu in #2595
Fix a bug when batch type is dict and one of the values is the list by @mvpatel2000 in #2599
Readme update by @ejyuen in #2581
Add chain of thought eval by @bmosaicml in #2466
Add torch 2.1.0 by @mvpatel2000 in #2602
Change pr cpu and pr gpu test docker images by @j316chuck in #2611
Change the tokenizer json file to read binary by @dakinggg in #2608
[Docs] MLflow casing by @aspfohl in #2609
Call generate callback at end of training by @aspfohl in #2607
Refactor save interval and eval interval to share code by @dakinggg in #2600
Deprecate many datasets and models by @mvpatel2000 in #2605
Clean up gpu tests by @mvpatel2000 in #2612
Remove apex test by @j316chuck in #2616
Patch default precision by @mvpatel2000 in #2628
Add logging for generate callbacks by @aspfohl in #2630
Expose input_names and output_names when exporting to ONNX by @antoinebrl in #2601
Bump version to 0.16.4 by @mvpatel2000 in #2627

New Contributors

@panchalhp-db made their first contribution in #2548
@cli99 made their first contribution in #2585

Full Changelog: v0.16.3...v0.16.4

Contributors

eracah, ejyuen, and 9 other contributors

Assets 2

26 Sep 18:07

mvpatel2000

v0.16.3

c82da77

v0.16.3

What's New

1. Add pass@k for HumanEval

HumanEval now supports pass@k. We also support first-class integration with the MosaicML platform for secure code evaluation.

2. log_model with MLFlow

The MLFlow integration now supports log_model at the end of the run.

What's Changed

Update checkpoint.py by @b-chu in #2540
Add log image to mlflow by @eracah in #2416
Log runtime estimator units by @mvpatel2000 in #2542
Bump traitlets from 5.9.0 to 5.10.0 by @dependabot in #2547
Bump gitpython from 3.1.35 to 3.1.36 by @dependabot in #2546
Bump ipykernel from 6.25.1 to 6.25.2 by @dependabot in #2544
Add providers param to ONNX Session in tests by @nik-mosaic in #2553
Bump flash attn by @mvpatel2000 in #2551
Remove pin by @mvpatel2000 in #2554
Change filter to include pull_request_target by @mvpatel2000 in #2557
Downgrade nightly to previous version by @mvpatel2000 in #2556
MCLI Code Eval by @rishab-partha in #2479
Bump cryptography from 41.0.3 to 41.0.4 by @dependabot in #2559
Bump gitpython from 3.1.36 to 3.1.37 by @dependabot in #2560
Update numpy requirement from <1.26.0,>=1.21.5 to >=1.21.5,<1.27.0 by @dependabot in #2561
Update support for HumanEval by @mcarbin in #2550
Add log_model to MLFlowLogger by @dakinggg in #2541
Bump version to 0.16.3 by @mvpatel2000 in #2566

New Contributors

@mcarbin made their first contribution in #2550

Full Changelog: v0.16.2...v0.16.3

Contributors

mcarbin, eracah, and 6 other contributors

Assets 2

14 Sep 16:09

mvpatel2000

v0.16.2

130bde5

v0.16.2

What's New

1. PyTorch Nightly Support

Composer now supports PyTorch Nightly and Cuda 12! Along with new docker images based on nightly PyTorch versions and release candidates, we've updated our PyTorch monkeypatches to support the latest version of PyTorch. These monkeypatches add additional functionality in finer-grain FSDP wrapping and patch bugs related to sharded checkpoints. We are in the process of upstreaming these changes into PyTorch.

Bug Fixes

1. MosaicML Logger Robustness

MosaicML logger now is robust to platform timeouts and other errors. Additionally, it can now be disabled by setting the environment variable MOSAICML_PLATFORM to 'False' when training on the MosaicML platform.

2. GCS Integration

GCS authentication is now supported with HMAC keys, patching a bug in the previous implementation.

3. Optimizer Monitor Norm Calculation (#2531)

Previously, the optimizer monitor incorrectly reduced norms across GPUs. It now correctly computes norms in a distributed setting.

What's Changed

fix: when there is no train_metrics, do not checkpoint by @furkanbiten in #2502
Remove metric saving by @mvpatel2000 in #2514
Fix daily tests by removing gpu marker by @j316chuck in #2515
Refactor mosaic_fsdp.py by @b-chu in #2506
Disable slack notifications for PRs by @mvpatel2000 in #2517
Add custom sharding to ChunkShardingSpec by @b-chu in #2507
Update nightly docker image to torch nightly 09-03-23 by @j316chuck in #2518
Update pre-commit in setup.py by @b-chu in #2522
Add FSDP custom wrap with torch 2.1 by @mvpatel2000 in #2460
Fix GCSObjectStore bug where hmac keys auth doesn't work by @eracah in #2519
Bump gitpython from 3.1.34 to 3.1.35 by @dependabot in #2525
Bump pytest from 7.4.0 to 7.4.2 by @dependabot in #2523
Upgrade to MLFlow version 2.5.0 by @ngcgarcia in #2528
Disable cifar daily test by @mvpatel2000 in #2527
Mosaicml logger robustness improvements by @mvpatel2000 in #2530
Fix metrics keys sort in DecoupledAdamW for OptimizerMonitor FSDP metric agreggation by @m1kol in #2531
Fix github actions for GCS integration testing by @mvpatel2000 in #2532
Fix GCS tests by @mvpatel2000 in #2535
Change cast for mosaicml logger by @mvpatel2000 in #2538
Bump Version to 0.16.2 by @mvpatel2000 in #2537
Bump transformers version by @dakinggg in #2539

New Contributors

@ngcgarcia made their first contribution in #2528
@m1kol made their first contribution in #2531

Full Changelog: v0.16.1...v0.16.2

Contributors

eracah, j316chuck, and 7 other contributors

Assets 2

Releases: mosaicml/composer

v0.19.0

What's New

1. Improved DTensor Support

2. Checkpoint Saving and Loading from Databricks MLFlow

3. Better Communication Computation Overlap in FSDP

4. Python3.11 + Torch2.2 Support

5. PEFT LoRA

6. Refactored Evaluation

7. Azure Checkpointing

8. MLFlow Checkpointing

Bug Fixes

What's Changed

New Contributors

Contributors

v0.18.2

Bug Fixes

What's Changed

Contributors

v0.18.1

Bug Fixes

What's Changed

New Contributors

Contributors

v0.18.0

New Features

1. Improved DTensor Support

2. Checkpoint Saving and Loading from Databricks MLFlow

Bug Fixes

Deprecations

What's Changed

New Contributors

Contributors

v0.17.2

New Features

Bug Fixes

What's Changed

New Contributors

Contributors

v0.17.1

Bug Fixes

What's Changed

Contributors

v0.17.0

What's New

Bug Fixes

What's Changed

New Contributors

Contributors

v0.16.4

What's New

What's Changed

New Contributors

Contributors

v0.16.3

What's New

What's Changed

New Contributors

Contributors

v0.16.2

What's New

Bug Fixes

What's Changed

New Contributors

Contributors