v0.3.2

mvpatel2000 released this 10 Oct 22:32

· 98 commits to main since this release

What's Changed

Support for bfloat16
Optimizations for top_k > 1
Support for fully-sharded data parallelism
Support tensor model parallelism when expert_parallel_world_size > num_experts
Optimizations for activation memory
Support activation quantization (thanks @dblalock!)
Optimizations for SM90 (Hopper)
Lots of bug fixes, cleanup and small optimizations

New Contributors

@vchiley made their first contribution in #9
@deepakn94 made their first contribution in #16
@b-chu made their first contribution in #19

Full Changelog: v0.1...v0.3.2

Contributors

dblalock, deepakn94, and 2 other contributors

Assets 2