Releases: databricks/megablocks
v0.7.0
What's Changed
- Bump
_version.py
to 0.7.0.dev0 by @eitanturok in #148 - Remove deprecated torch.cuda.amp custom fwd and bwd by @snarayan21 in #150
- Implement Router Z-loss by @josejg in #151
- Initialize default device lazily by @janEbert in #152
- Update router lint by @mihir-db in #158
- Bump torch 2.5.1 and upgrade to 0.8.0.dev0 by @j316chuck in #162
New Contributors
- @josejg made their first contribution in #151
- @janEbert made their first contribution in #152
- @mihir-db made their first contribution in #158
Full Changelog: v0.6.1...v0.7.0
v0.6.1
What's New
Patch release to remove dependencies specified via github and instead use released versions through pypi (specifically, stanford-stk and grouped-gemm). This allows for releasing megablocks itself via pypi.
What's Changed
- Remove direct dependencies, allowing for megablocks pypi release by @snarayan21 in #149
Full Changelog: v0.6.0...v0.6.1
v0.6.0
What's New
1. Torch 2.4 Compatibility (#145)
MegaBlocks now supports Torch 2.4!
2. New CI/CD
MegaBlocks has new Github Actions for better CI/CD! Now on every PR, MegaBlocks will automatically perform code linting and formatting (#131) and run tests on a GPU (#127).
3. Remove Weight Parallelism (#137)
Weight parallelism was not in use and so we removed it.
4. Shared Experts (#109)
Implement shared experts, based on the DeepSeekMoE paper.
Bug Fixes
- Better handle incompatible ffn sizes (#108)
- Fix AMP for memory optimized options (#111)
- Don't save moe lb-loss tensors (#119)
What's Changed
- Remove turbo by @dblalock in #96
- Update README.md by @dakinggg in #98
- Fix for
ffn_hidden_size
of 128, and better error message for incompatible ffn sizes. by @snarayan21 in #108 - Add Shared Expert by @vchiley in #109
- Fix AMP for memory optimized options by @mvpatel2000 in #111
- bump and pin versions by @vchiley in #112
- dont save moe lb-loss tensors if args.moe_loss_weight=0 by @michael-go in #119
- bump by @vchiley in #116
- Minor changes to batched_load_balancing_loss function by @ShashankMosaicML in #121
- Migrate tests to pytest + add GA by @eitanturok in #127
- Change Runner in GA by @eitanturok in #129
- Clean up setup.py by @eitanturok in #128
- only run GA if repo owner is Databricks by @eitanturok in #135
- GA to Lint + Format MegaBlocks by @eitanturok in #131
- bump ci-testing to v0.1.2 by @eitanturok in #138
- remove weight parallelism by @eitanturok in #137
- refactor testing by @eitanturok in #140
- Type Checking by @eitanturok in #141
- Bump torch to <2.4.1 by @eitanturok in #145
New Contributors
- @dakinggg made their first contribution in #98
- @michael-go made their first contribution in #119
- @ShashankMosaicML made their first contribution in #121
Full Changelog: v0.5.1...v0.6.0
v0.5.1
What's Changed
- Update dependencies and package organization. by @tgale96 in #52
- Remove errant "*" in README by @tgale96 in #54
- Update Megatron-LM scripts and integration for latest Docker container. by @tgale96 in #55
- Update setup.py to support multiple device capabilities by @simon-mo in #56
- enable arg enabled normalization of routing weights by @vchiley in #58
- More customizable norm for expert weights by @snarayan21 in #60
- Update README.md by @eltociear in #63
- enable custom activation functions by @vchiley in #65
- Skip updating load balancing loss on eval by @sedrick-keh-tri in #69
- Change router weight norm from in-place by @sashaDoubov in #70
- add mem optimized grouped glu by @vchiley in #66
- Add cast to tensor for DTensor inputs for groupedmlp by @eracah in #71
- Dtensor to all paths by @mvpatel2000 in #73
- Refactor dtesnor by @mvpatel2000 in #74
- Mem opt glu bkwd by @mvpatel2000 in #72
- Add dmlp registry args by @j316chuck in #75
- Fix default to be sparse by @mvpatel2000 in #76
- Fix
moe_normalize_expert_weights
whentop_k=1
by @152334H in #87 - Updt triton pin by @vchiley in #89
New Contributors
- @simon-mo made their first contribution in #56
- @snarayan21 made their first contribution in #60
- @eltociear made their first contribution in #63
- @sedrick-keh-tri made their first contribution in #69
- @eracah made their first contribution in #71
- @j316chuck made their first contribution in #75
- @152334H made their first contribution in #87
Full Changelog: v0.5.0...v0.5.1
v0.5.0
What's New
Several improvements to avoid CPU <> GPU device synchronizations, GLU support, and support for some new models 👀
What's Changed
- Update version by @mvpatel2000 in #36
- Avoid duplicate
.cpu()
call by @mvpatel2000 in #37 - Have megablocks rely on torch default precision by @mvpatel2000 in #39
- Add GLU support by @sashaDoubov in #38
- Enable generic dimentionality for input by @vchiley in #41
- Removing an extra size call by @bcui19 in #43
- Fix bug in topology kernel for ffn_hidden_size>4096. by @tgale96 in #47
New Contributors
- @sashaDoubov made their first contribution in #38
- @bcui19 made their first contribution in #43
Full Changelog: v0.4.0...v0.5.0
v0.4.0
What's Changed
- Unpack saved context once by @mvpatel2000 in #33
- Refactoring class hierarchy for FSDP wrapping by @tgale96 in #34
Full Changelog: v0.3.3...v0.4.0
v0.3.3
What's Changed
Full Changelog: v0.3.2...v0.3.3
v0.3.2
What's Changed
- Support for bfloat16
- Optimizations for top_k > 1
- Support for fully-sharded data parallelism
- Support tensor model parallelism when expert_parallel_world_size > num_experts
- Optimizations for activation memory
- Support activation quantization (thanks @dblalock!)
- Optimizations for SM90 (Hopper)
- Lots of bug fixes, cleanup and small optimizations
New Contributors
- @vchiley made their first contribution in #9
- @deepakn94 made their first contribution in #16
- @b-chu made their first contribution in #19
Full Changelog: v0.1...v0.3.2
Version 0.1
Initial release documenting repository state prior to MLSys'23 camera-ready publication.