Release v0.14.2 Patch release · microsoft/DeepSpeed

What's Changed

Update version.txt after 0.14.1 release by @mrwyattii in #5413
Remove dtype(fp16) condition check for residual_add unit test by @raza-sikander in #5329
[XPU] Use non_daemonic_proc by default on XPU device by @ys950902 in #5412
Fix a convergence issues in TP topology caused by incorrect grad_norm. by @inkcherry in #5411
Update 'create-pr' action in release workflow to latest by @loadams in #5415
Update engine.py to avoid torch warning by @etiennebonnafoux in #5408
Update _sidebar.scss by @fasterinnerlooper in #5293
Add more tests into XPU CI by @Liangliang-Ma in #5427
[CPU] Support SHM based inference_all_reduce in TorchBackend by @delock in #5391
Add required paths to trigger AMD tests on PRs by @loadams in #5406
Bug fix in split_index method by @bm-synth in #5292
Parallel map step for DistributedDataAnalyzer map-reduce by @bm-synth in #5291
Selective dequantization by @RezaYazdaniAminabadi in #5375
Fix sorting of shard optimizer states files for universal checkpoint by @tohtana in #5395
add device config env for the accelerator by @shiyuan680 in #5396
64bit indexing fused adam by @garrett4wade in #5187
Improve parallel process of universal checkpoint conversion by @tohtana in #5343
set the default to use set_to_none for clearing gradients in BF16 optimizer. by @inkcherry in #5434
OptimizedLinear implementation by @jeffra in #5355
Update README.md by @Jhonso7393 in #5453
Update PyTest torch version to match PyTorch latest official (2.3.0) by @loadams in #5454

Full Changelog: v0.14.1...v0.14.2