Releases: mosaicml/composer
v0.19.0
What's New
1. Improved DTensor Support
Composer now supports elastic saving and loading of DTensors at various mesh sizes.
2. Checkpoint Saving and Loading from Databricks MLFlow
Composer now supports saving and loading checkpoints to Databricks-managed MLFlow.
composer_model = MyComposerModel(...)
trainer = Trainer(
model=composer_model,
save_folder= 'dbfs:/databricks/mlflow-tracking/{mlflow_experiment_id}/{mlflow_run_id}/artifacts',
logger=MLFlowLogger(...),
load_path= 'dbfs:/databricks/mlflow-tracking/{mlflow_experiment_id}/{mlflow_run_id}/artifacts',
...
)
3. Better Communication Computation Overlap in FSDP
Composer now has improved communication/computation overlap in our FSDP code which should improve MFU across several architectures.
4. Python3.11 + Torch2.2 Support
Initial support of Python3.11 + Torch2.2 added in Composer.
5. PEFT LoRA
PEFT LoRA is now supported in the HuggingFaceModel class.
6. Refactored Evaluation
in_context_learning_evaluation.py
has a new design with cleaner abstractions and easier interfaces to work wtih.
7. Azure Checkpointing
Composer now supports saving your model in Azure.
8. MLFlow Checkpointing
Composer now supports saving your model in MLFlow.
Bug Fixes
- Fix MLFlowLogger test by @ngcgarcia in #2912
- Fix bug with CoT early stopping and LLama2 tokenizer by @bmosaicml in #2902
- Fix split_batch bug with empty generation_kwargs by @maxisawesome in #2913
- Only load RNG keys that exist by @mvpatel2000 in #2901
- Fix daily tests by @mvpatel2000 in #2891
- Fix seed for FSDP wrap by @mvpatel2000 in #2833
- Fix load_ignore_keys with rng by @mvpatel2000 in #2803
- Fix mosaicml logger on close by @mvpatel2000 in #2816
- Fix torch profiler error on close by @mvpatel2000 in #2818
- Fix import for daily test by @snarayan21 in #2826
- Fix how single value tensors are logged by @aspfohl in #2831
- Fix torch bump by @j316chuck in #2855
- Fix MPS with sequence loss by @JAEarly in #2834
What's Changed
- Bump transformers version by @dakinggg in #2781
- Bump sphinxext-opengraph from 0.9.0 to 0.9.1 by @dependabot in #2784
- Bump coverage[toml] from 7.3.0 to 7.3.3 by @dependabot in #2783
- Update torch requirement from <2.1.2,>=1.13.1 to >=1.13.1,<2.1.3 by @dependabot in #2785
- [UCVolumes] Rely on databricks-sdk auth for the right requirements by @panchalhp-db in #2789
- Enable system metrics in mosaic mlflow logger by @chenmoneygithub in #2775
- Update parse_uri by @irenedea in #2787
- default to no torch profiler memory timeline by @cli99 in #2790
- Add eot token to ICL generate kwargs by @bmosaicml in #2782
- Add nightly image for torch 2.2.0-12-20-23 by @j316chuck in #2791
- Add torch nightly 12-13 by @j316chuck in #2792
- Add process group as arg to FSDP by @mvpatel2000 in #2794
- Bump coverage[toml] from 7.3.3 to 7.3.4 by @dependabot in #2798
- Bump ipykernel from 6.26.0 to 6.28.0 by @dependabot in #2806
- Bump junitparser from 3.1.0 to 3.1.1 by @dependabot in #2805
- Bump pytest from 7.4.3 to 7.4.4 by @dependabot in #2807
- Avoid futures on close for MosaicML logger by @mvpatel2000 in #2804
- Require sync module states with HSDP by @mvpatel2000 in #2812
- Better communication computation overlap by @snarayan21 in #2811
- Improve error message for speed monitor by @mvpatel2000 in #2801
- Bump torch version -- DO NOT RELEASE by @mvpatel2000 in #2814
- Bump torchvision for nightly by @mvpatel2000 in #2815
- Correct multi-unshard stream patching for torch 2.2.0dev, and stream waiting correctness. by @snarayan21 in #2817
- Bump traitlets from 5.13.0 to 5.14.1 by @dependabot in #2822
- All unshard streams wait on computation every step by @snarayan21 in #2823
- Add encoding=utf-8 by @dakinggg in #2824
- [MLFlowObjectStore] [1/2] Base implementation for MLFlowObjectStore by @jerrychen109 in #2802
- Remove fused layernorm (already deprecated for 2 versions) by @mvpatel2000 in #2827
- checkpoint saver tracks all checkpoints/intervals in state by @aspfohl in #2819
- code-quality timeout update by @aspfohl in #2830
- Adds DTensor Support by @mvpatel2000 in #2821
- Remove duplicate checkpoint verifications by @eracah in #2828
- Remove fsdp patch for comm overlap by @mvpatel2000 in #2836
- Allow hsdp by @mvpatel2000 in #2838
- Bump torch 2.1.2 by @mvpatel2000 in #2840
- Upgrade pyright to 1.1.310 by @b-chu in #2841
- [MLFlowObjectStore] [2/2] Support checkpointing with MLFlow by @jerrychen109 in #2810
- update nightly to torch 2.3 by @j316chuck in #2842
- Pin sphinxcontrib applehelp by @mvpatel2000 in #2854
- Torch 2.3 patch by @dakinggg in #2849
- Update mosaicml-cli requirement from <0.6,>=0.5.25 to >=0.5.25,<0.7 by @dependabot in #2866
- Rewrite to use individual state functions by @mvpatel2000 in #2860
- Add custom stopping criteria to ICL generate tasks by @bmosaicml in #2800
- Add save_ignore_keys by @mvpatel2000 in #2868
- Remome log debug by @mvpatel2000 in #2871
- Update monkeypatch to put barrier in optim load by @mvpatel2000 in #2874
- Remove toml by @b-chu in #2872
- Update license by @b-chu in #2875
- Add ignore_metrics field to the MLflow logger by @ngcgarcia in #2869
- Convert print to log.info by @mvpatel2000 in #2876
- Bump version to 0.18.0 by @irenedea in #2877
- Removed commented-out unshard streams patching. by @snarayan21 in #2873
- Make code quality workflow reusable by @b-chu in #2878
- Bump gitpython from 3.1.40 to 3.1.41 by @dependabot in #2885
- Bump torchmetrics by @mvpatel2000 in #2890
- Bump transformers to 4.37 by @dakinggg in #2894
- Azure checkpointing support by @mvpatel2000 in #2893
- Pass PG into checkpoint load and load rng with state_dict by @mvpatel2000 in #2897
- Remove monkeypatch and new state dict APIs for torch 2.2 by @mvpatel2000 in #2899
- Bump version to 0.18.1 by @b-chu in #2905
- Refactor in_context_learning_evaluation.py by @maxisawesome in #2713
- Fix FP8 checkpoint resumption with onnx export flag by @j316chuck in #2907
- Add Python 3.11 + FA 2.5.0 + Torch 2.3.0 Image by @KuuCi in #2898
- Add yamllint to pre commit by @b-chu in #2909
- Add ignore_hyperparameters to MLFlowLogger by @ngcgarcia in #2908
- Bump coverage[toml] from 7.3.4 to 7.4.1 by @dependabot in #2915
- Add checkpoint test for 0.18.1 by @b-chu in #2906
- Integrate PEFT LoRA with HuggingFaceModel by @dakinggg in #2829
New Contributors
- @jerrychen109 made their first contribution in #2802
- @JAEarly made their first contribution in https://github.com/mosa...
v0.18.2
Bug Fixes
- Fix lp layernorm weight by @snarayan21 in #2954
What's Changed
- Fix lp layernorm weight by @snarayan21 in #2954
- Bump version to 0.18.2 by @b-chu
Full Changelog: v0.18.1...v0.18.2
v0.18.1
Bug Fixes
- Fix MPS with sequence loss by @JAEarly in #2834
- Fix daily tests by @mvpatel2000 in #2891
- Remove monkeypatch and new state dict APIs for torch 2.2 by @mvpatel2000 in #2899
- Only load RNG keys that exist by @mvpatel2000 in #2901
What's Changed
- Bump version to 0.18.0 by @irenedea in #2877
- Removed commented-out unshard streams patching. by @snarayan21 in #2873
- Make code quality workflow reusable by @b-chu in #2878
- Bump gitpython from 3.1.40 to 3.1.41 by @dependabot in #2885
- Fix MPS with sequence loss by @JAEarly in #2834
- Bump torchmetrics by @mvpatel2000 in #2890
- Fix daily tests by @mvpatel2000 in #2891
- Bump transformers to 4.37 by @dakinggg in #2894
- Azure checkpointing support by @mvpatel2000 in #2893
- Pass PG into checkpoint load and load rng with state_dict by @mvpatel2000 in #2897
- Remove monkeypatch and new state dict APIs for torch 2.2 by @mvpatel2000 in #2899
- Only load RNG keys that exist by @mvpatel2000 in #2901
- Bump version to 0.18.1 by @b-chu in #2905
New Contributors
Full Changelog: v0.18.0...v0.18.1
v0.18.0
This release has been yanked, please skip directly to Composer v0.18.1
New Features
1. Improved DTensor Support
Composer now supports elastic saving and loading of DTensors at various mesh sizes.
2. Checkpoint Saving and Loading from Databricks MLFlow
Composer now supports saving and loading checkpoints to Databricks-managed MLFlow.
composer_model = MyComposerModel(...)
trainer = Trainer(
model=composer_model,
save_folder= 'dbfs:/databricks/mlflow-tracking/{mlflow_experiment_id}/{mlflow_run_id}/artifacts',
logger=MLFlowLogger(...),
load_path= 'dbfs:/databricks/mlflow-tracking/{mlflow_experiment_id}/{mlflow_run_id}/artifacts',
...
)
Bug Fixes
- Fix load_ignore_keys with rng by @mvpatel2000 in #2803
- Fix mosaicml logger on close by @mvpatel2000 in #2816
- Fix torch profiler error on close by @mvpatel2000 in #2818
- Fix import for daily test by @snarayan21 in #2826
- [S] Fix how single value tensors are logged by @aspfohl in #2831
Deprecations
- Remove fused layernorm (already deprecated for 2 versions) by @mvpatel2000 in #2827
What's Changed
- Bump transformers version by @dakinggg in #2781
- Bump sphinxext-opengraph from 0.9.0 to 0.9.1 by @dependabot in #2784
- Bump coverage[toml] from 7.3.0 to 7.3.3 by @dependabot in #2783
- Update torch requirement from <2.1.2,>=1.13.1 to >=1.13.1,<2.1.3 by @dependabot in #2785
- [UCVolumes] Rely on databricks-sdk auth for the right requirements by @panchalhp-db in #2789
- Enable system metrics in mosaic mlflow logger by @chenmoneygithub in #2775
- Update parse_uri by @irenedea in #2787
- default to no torch profiler memory timeline by @cli99 in #2790
- Add eot token to ICL generate kwargs by @bmosaicml in #2782
- Add nightly image for torch 2.2.0-12-20-23 by @j316chuck in #2791
- Add torch nightly 12-13 by @j316chuck in #2792
- Add process group as arg to FSDP by @mvpatel2000 in #2794
- Bump coverage[toml] from 7.3.3 to 7.3.4 by @dependabot in #2798
- Fix load_ignore_keys with rng by @mvpatel2000 in #2803
- Bump ipykernel from 6.26.0 to 6.28.0 by @dependabot in #2806
- Bump junitparser from 3.1.0 to 3.1.1 by @dependabot in #2805
- Bump pytest from 7.4.3 to 7.4.4 by @dependabot in #2807
- Avoid futures on close for MosaicML logger by @mvpatel2000 in #2804
- Require sync module states with HSDP by @mvpatel2000 in #2812
- Better communication computation overlap by @snarayan21 in #2811
- Improve error message for speed monitor by @mvpatel2000 in #2801
- Bump torch version -- DO NOT RELEASE by @mvpatel2000 in #2814
- Bump torchvision for nightly by @mvpatel2000 in #2815
- Fix mosaicml logger on close by @mvpatel2000 in #2816
- Correct multi-unshard stream patching for torch 2.2.0dev, and stream waiting correctness. by @snarayan21 in #2817
- Fix torch profiler error on close by @mvpatel2000 in #2818
- Bump traitlets from 5.13.0 to 5.14.1 by @dependabot in #2822
- All unshard streams wait on computation every step by @snarayan21 in #2823
- Add encoding=utf-8 by @dakinggg in #2824
- Fix import for daily test by @snarayan21 in #2826
- [MLFlowObjectStore] [1/2] Base implementation for MLFlowObjectStore by @jerrychen109 in #2802
- Remove fused layernorm (already deprecated for 2 versions) by @mvpatel2000 in #2827
- checkpoint saver tracks all checkpoints/intervals in state by @aspfohl in #2819
- code-quality timeout update by @aspfohl in #2830
- [S] Fix how single value tensors are logged by @aspfohl in #2831
- Adds DTensor Support by @mvpatel2000 in #2821
- Remove duplicate checkpoint verifications by @eracah in #2828
- Fix seed for FSDP wrap by @mvpatel2000 in #2833
- Remove fsdp patch for comm overlap by @mvpatel2000 in #2836
- Allow hsdp by @mvpatel2000 in #2838
- Bump torch 2.1.2 by @mvpatel2000 in #2840
- Upgrade pyright to 1.1.310 by @b-chu in #2841
- [MLFlowObjectStore] [2/2] Support checkpointing with MLFlow by @jerrychen109 in #2810
- update nightly to torch 2.3 by @j316chuck in #2842
- Pin sphinxcontrib applehelp by @mvpatel2000 in #2854
- Fix torch bump by @j316chuck in #2855
- Torch 2.3 patch by @dakinggg in #2849
- Update mosaicml-cli requirement from <0.6,>=0.5.25 to >=0.5.25,<0.7 by @dependabot in #2866
- Rewrite to use individual state functions by @mvpatel2000 in #2860
- Add custom stopping criteria to ICL generate tasks by @bmosaicml in #2800
- Add save_ignore_keys by @mvpatel2000 in #2868
- Remome log debug by @mvpatel2000 in #2871
- Update monkeypatch to put barrier in optim load by @mvpatel2000 in #2874
- Remove toml by @b-chu in #2872
- Update license by @b-chu in #2875
- Add ignore_metrics field to the MLflow logger by @ngcgarcia in #2869
- Convert print to log.info by @mvpatel2000 in #2876
New Contributors
- @jerrychen109 made their first contribution in #2802
Full Changelog: v0.17.2...v0.18.0
v0.17.2
New Features
1. Torch 2.1.1 Support
Composer now supports torch 2.1.1! This new release primarily fixes several small bugs that we had previously monkeypatched in Composer.
2. Faster OCI Upload/Download
Composer now supports multi-part upload/download to OCI, which should speedup object store times.
3. Memory Profiling
We've expanded the torch profiler integration to support memory profiling. Now, when the profile is enabled, you will get a trace showing how memory utilization is broken down by various components on your GPUs.
Bug Fixes
1. FSDP Initialization with Meta
Previously, our FSDP integration had a bug with initializing weights when using device=meta
, which resulted in an additional scaling. This has now been fixed, so device
and distributed strategies should not affect parallelization strategy.
What's Changed
- Override NVIDIA environment variable for CUDA 12.1 images by @bandish-shah in #2742
- Add NVIDIA_REQUIRE_CUDA_OVERRIDE env variable to Composer and Torch nightly Docker images by @bandish-shah in #2744
- Remove duplicated for loop in lr_monitor.py by @priba in #2738
- Fix console logger for small datasets. by @mvpatel2000 in #2746
- Add metadata logging for wandb by @jjanezhang in #2747
- Ignore load ignore keys by @mvpatel2000 in #2748
- Bump torch to 2.1.1 version by @j316chuck in #2717
- Add more info when run doesnt complete by @aspfohl in #2751
- Lower sequence generation length on code gen to be dependent on max canonical solution length by @bmosaicml in #2682
- Remove flatten params by @mvpatel2000 in #2761
- Fix GPU tests by @mvpatel2000 in #2767
- Fix GPU v2 by @mvpatel2000 in #2768
- Use time.tokens for speedmonitor instead of dataset length by @mvpatel2000 in #2762
- Remove BreakEpochException by @mvpatel2000 in #2759
- time to clean up time parsing 😉 by @aspfohl in #2770
- Upgrade RunConfig compute specification by @aspfohl in #2772
- Use async logging in MLflowLogger by @chenmoneygithub in #2693
- Fix FSDP _param_init_fn to not reinit parameters multiple times by @dakinggg in #2765
- Gate FSDP param init test on torch 2.1 by @dakinggg in #2774
- Parallelize OCI multipart download by @coryMosaicML in #2750
- [UCVolumes] Add support for list API by @panchalhp-db in #2769
- Add the memory timeline profiling support through the PyTorch profiler. by @cli99 in #2771
- Improve torch memory profiling arguments processing by @cli99 in #2777
- Bump aws of nccl version and enable aws platform support by @willgleich in #2776
- Extend checkpoint loading to accept a validation function by @irenedea in #2726
- Fix checkpoint validation tests for torch 1.13 by @irenedea in #2779
- Bump version to 0.17.2 by @mvpatel2000 in #2780
New Contributors
- @chenmoneygithub made their first contribution in #2693
Full Changelog: v0.17.1...v0.17.2
v0.17.1
Bug Fixes
1. MosaicML Logger Robustness (#2728)
We've improved the MosaicML logger to be more robust to faulty serialization.
What's Changed
- Add train finished run event by @jjanezhang in #2714
- Override nvidia env var for 11.8 by @dakinggg in #2722
- Update file exists checkpointing error messages to be more helpful by @irenedea in #2668
- [S] Add tag support to MLFlowLogger by @aspfohl in #2716
- Use
raise ... from e
to preserve stack trace by @irenedea in #2725 - add 0.17 to bcompat tests by @eracah in #2723
- Add support for canned ACL environment variable by @nik-mosaic in #2729
- Check serialization for JSON in mosaicml logger by @mvpatel2000 in #2728
- Fix profiler issue by @j316chuck in #2735
- Fix activation cpu offloading by @cli99 in #2724
- Bump version 0.17.1 by @mvpatel2000 in #2741
Full Changelog: v0.17.0...v0.17.1
v0.17.0
What's New
1. Hybrid Sharded Data Parallel (HSDP) Integration (#2648)
Composer now supports Hybrid Sharded Data Parallel (HSDP), where a model is both sharded and replicated across blocks of controllable size. By default, this will shard a model within a node and replicate across nodes, but Composer will accept a tuple of process groups to specify custom shard/replicate sizes. This can be specified in the FSDP config.
composer_model = MyComposerModel(n_layers=3)
fsdp_config = {
'sharding_strategy': 'HYBRID_SHARD',
}
trainer = Trainer(
model=composer_model,
max_duration='4ba',
fsdp_config=fsdp_config,
...
)
HYBRID_SHARD
will FULL_SHARD
a model whereas _HYBRID_SHARD_ZERO2
will SHARD_GRAD_OP
within the shard block.
2. Train Loss NaN Monitor (#2704)
Composer has a new callback which will raise a value error if your loss NaNs out. This is very useful to avoid wasting compute if your training run diverges or fails for numerical reasons.
from composer.callbacks import NaNMonitor
composer_model = MyComposerModel(n_layers=3)
trainer = Trainer(
model=composer_model,
max_duration='4ba',
callbacks=NaNMonitor(),
...
)
Bug Fixes
- Fix MPS with dict loss by @mvpatel2000 in #2706
- Squelch Memory Monitor warnings if device=meta by @hanlint in #2529
- Switch mosaicml logger to use futures to enable better error handling by @j316chuck in #2702
What's Changed
- Add partial state dict functionality for FSDP by @b-chu in #2637
- Update monai requirement from <1.3,>=0.9.1 to >=0.9.1,<1.4 by @dependabot in #2643
- Bump pytest-codeblocks from 0.16.1 to 0.17.0 by @dependabot in #2645
- Remove checkpoint on close by @mvpatel2000 in #2646
- Update latest to 2.1 by @mvpatel2000 in #2650
- HSDP Support by @mvpatel2000 in #2648
- Log profile averages by @j316chuck in #2647
- Daily API key by @mvpatel2000 in #2655
- Add automatic remote uploader downloader for composer profiler by @j316chuck in #2653
- Update the AWS_OFI_NCCL version and add in the MPI HWLOC install by @willgleich in #2651
- Fix GCP tests by @mvpatel2000 in #2658
- Allow no eval_loader when eval is disabled by @b-chu in #2657
- Gate HSDP by torch 2.1.0 by @mvpatel2000 in #2656
- Fix FSDP arg default to match torch by @mvpatel2000 in #2660
- Bump pypandoc from 1.11 to 1.12 by @dependabot in #2664
- Bump vit-pytorch from 0.35.8 to 1.6.1 by @dependabot in #2662
- Upgrade to transformers 4.34.1 by @dakinggg in #2635
- Update docker readme by @mvpatel2000 in #2669
- Add script to validate remote object store paths by @irenedea in #2667
- Torch 2.1 Resumption Support by @mvpatel2000 in #2665
- Bump gitpython from 3.1.37 to 3.1.40 by @dependabot in #2663
- Fix dist by @mvpatel2000 in #2670
- Add torch nightly for torch 2.2.0 10-24 by @j316chuck in #2671
- Adding Model Data Init and Training Progress to MosaicMLLogger by @jjanezhang in #2633
- Bump pytest from 7.4.2 to 7.4.3 by @dependabot in #2678
- Bump sphinxext-opengraph from 0.8.2 to 0.9.0 by @dependabot in #2677
- Bump traitlets from 5.10.0 to 5.12.0 by @dependabot in #2674
- Bump cryptography from 41.0.4 to 41.0.5 by @dependabot in #2675
- Secure Code Eval changes by @mvpatel2000 in #2679
- Lazy validation of code eval metric by @mvpatel2000 in #2681
- Upgrade transformers to 4.35 by @dakinggg in #2684
- Bump traitlets from 5.12.0 to 5.13.0 by @dependabot in #2687
- Bump ipykernel from 6.25.2 to 6.26.0 by @dependabot in #2686
- Add Kwargs to upload_object by @nik-mosaic in #2692
- Add version number to composer metadata logs by @j316chuck in #2565
- Add distributed barrier test fixture to ensure pytest cleans up resources properly by @j316chuck in #2694
- Properly handle empty metric_names passed to Trainer._filter_metrics by @irenedea in #2700
- Train loss NaN checking callback by @coryMosaicML in #2704
- Adding logging and force flushing for run events by @jjanezhang in #2703
- [daily-test fix] Add rank 0 gating to test_elastic_resumption state dict comparison by @eracah in #2705
- Fix MPS with dict loss by @mvpatel2000 in #2706
- Update types to follow PEP 585 by @b-chu in #2697
- Bump yamllint from 1.32.0 to 1.33.0 by @dependabot in #2708
- Update wandb requirement from <0.16,>=0.13.2 to >=0.13.2,<0.17 by @dependabot in #2709
- Squelch Memory Monitor warnings if device=meta by @hanlint in #2529
- Fix NaN monitor for loss dicts. by @coryMosaicML in #2712
- Switch mosaicml logger to use futures to enable better error handling by @j316chuck in #2702
- Fetching arguments for FSDP by @mvpatel2000 in #2710
- Bump version to 0.17 by @mvpatel2000 in #2711
New Contributors
- @willgleich made their first contribution in #2651
- @jjanezhang made their first contribution in #2633
Full Changelog: v0.16.4...v0.17.0
v0.16.4
What's New
1. Torch 2.1 Support
Composer officially supports PyTorch 2.1! We support several new features from 2.1, including CustomPolicy which supports granular wrapping with FSDP.
What's Changed
- Add 0.16 checkpoint to backwards compatibility tests by @eracah in #2567
- Updating FSDP monkeypatch by @mvpatel2000 in #2571
- Add Databricks UC Volume Object Store by @panchalhp-db in #2548
- Fix pytest disk space OOM issue by adding tmp_path_retention_policy=None by @j316chuck in #2583
- Change daily nightly test version by @j316chuck in #2596
- Add save and register wrappers to mlflow logger by @dakinggg in #2579
- Missing () fo or in auto microbatching gate by @mvpatel2000 in #2574
- Simplify FSDP Gradient Clipping by @mvpatel2000 in #2586
- Use FSDP CustomPolicy to support custom kwargs passed to different wrapped modules by @cli99 in #2585
- Free outputs callback by @mvpatel2000 in #2598
- Merge branch 'dev' into spr/dev/458c4e36 by @b-chu in #2595
- Fix a bug when batch type is dict and one of the values is the list by @mvpatel2000 in #2599
- Readme update by @ejyuen in #2581
- Add chain of thought eval by @bmosaicml in #2466
- Add torch 2.1.0 by @mvpatel2000 in #2602
- Change pr cpu and pr gpu test docker images by @j316chuck in #2611
- Change the tokenizer json file to read binary by @dakinggg in #2608
- [Docs] MLflow casing by @aspfohl in #2609
- Call generate callback at end of training by @aspfohl in #2607
- Refactor save interval and eval interval to share code by @dakinggg in #2600
- Deprecate many datasets and models by @mvpatel2000 in #2605
- Clean up gpu tests by @mvpatel2000 in #2612
- Remove apex test by @j316chuck in #2616
- Patch default precision by @mvpatel2000 in #2628
- Add logging for generate callbacks by @aspfohl in #2630
- Expose input_names and output_names when exporting to ONNX by @antoinebrl in #2601
- Bump version to 0.16.4 by @mvpatel2000 in #2627
New Contributors
- @panchalhp-db made their first contribution in #2548
- @cli99 made their first contribution in #2585
Full Changelog: v0.16.3...v0.16.4
v0.16.3
What's New
1. Add pass@k for HumanEval
HumanEval now supports pass@k. We also support first-class integration with the MosaicML platform for secure code evaluation.
2. log_model
with MLFlow
The MLFlow integration now supports log_model
at the end of the run.
What's Changed
- Update checkpoint.py by @b-chu in #2540
- Add log image to mlflow by @eracah in #2416
- Log runtime estimator units by @mvpatel2000 in #2542
- Bump traitlets from 5.9.0 to 5.10.0 by @dependabot in #2547
- Bump gitpython from 3.1.35 to 3.1.36 by @dependabot in #2546
- Bump ipykernel from 6.25.1 to 6.25.2 by @dependabot in #2544
- Add providers param to ONNX Session in tests by @nik-mosaic in #2553
- Bump flash attn by @mvpatel2000 in #2551
- Remove pin by @mvpatel2000 in #2554
- Change filter to include pull_request_target by @mvpatel2000 in #2557
- Downgrade nightly to previous version by @mvpatel2000 in #2556
- MCLI Code Eval by @rishab-partha in #2479
- Bump cryptography from 41.0.3 to 41.0.4 by @dependabot in #2559
- Bump gitpython from 3.1.36 to 3.1.37 by @dependabot in #2560
- Update numpy requirement from <1.26.0,>=1.21.5 to >=1.21.5,<1.27.0 by @dependabot in #2561
- Update support for HumanEval by @mcarbin in #2550
- Add log_model to MLFlowLogger by @dakinggg in #2541
- Bump version to 0.16.3 by @mvpatel2000 in #2566
New Contributors
Full Changelog: v0.16.2...v0.16.3
v0.16.2
What's New
1. PyTorch Nightly Support
Composer now supports PyTorch Nightly and Cuda 12! Along with new docker images based on nightly PyTorch versions and release candidates, we've updated our PyTorch monkeypatches to support the latest version of PyTorch. These monkeypatches add additional functionality in finer-grain FSDP wrapping and patch bugs related to sharded checkpoints. We are in the process of upstreaming these changes into PyTorch.
Bug Fixes
1. MosaicML Logger Robustness
MosaicML logger now is robust to platform timeouts and other errors. Additionally, it can now be disabled by setting the environment variable MOSAICML_PLATFORM
to 'False'
when training on the MosaicML platform.
2. GCS Integration
GCS authentication is now supported with HMAC keys, patching a bug in the previous implementation.
3. Optimizer Monitor Norm Calculation (#2531)
Previously, the optimizer monitor incorrectly reduced norms across GPUs. It now correctly computes norms in a distributed setting.
What's Changed
- fix: when there is no train_metrics, do not checkpoint by @furkanbiten in #2502
- Remove metric saving by @mvpatel2000 in #2514
- Fix daily tests by removing gpu marker by @j316chuck in #2515
- Refactor mosaic_fsdp.py by @b-chu in #2506
- Disable slack notifications for PRs by @mvpatel2000 in #2517
- Add custom sharding to ChunkShardingSpec by @b-chu in #2507
- Update nightly docker image to torch nightly 09-03-23 by @j316chuck in #2518
- Update pre-commit in setup.py by @b-chu in #2522
- Add FSDP custom wrap with torch 2.1 by @mvpatel2000 in #2460
- Fix GCSObjectStore bug where hmac keys auth doesn't work by @eracah in #2519
- Bump gitpython from 3.1.34 to 3.1.35 by @dependabot in #2525
- Bump pytest from 7.4.0 to 7.4.2 by @dependabot in #2523
- Upgrade to MLFlow version 2.5.0 by @ngcgarcia in #2528
- Disable cifar daily test by @mvpatel2000 in #2527
- Mosaicml logger robustness improvements by @mvpatel2000 in #2530
- Fix metrics keys sort in DecoupledAdamW for OptimizerMonitor FSDP metric agreggation by @m1kol in #2531
- Fix github actions for GCS integration testing by @mvpatel2000 in #2532
- Fix GCS tests by @mvpatel2000 in #2535
- Change cast for mosaicml logger by @mvpatel2000 in #2538
- Bump Version to 0.16.2 by @mvpatel2000 in #2537
- Bump transformers version by @dakinggg in #2539
New Contributors
- @ngcgarcia made their first contribution in #2528
- @m1kol made their first contribution in #2531
Full Changelog: v0.16.1...v0.16.2