Releases: laekov/fastmoe
Releases · laekov/fastmoe
v1.1.0
Performance
- Smart schedule of FasterMoE is updated with correct stream management, and becomes faster.
Testing
- All unit tests are checked and they run correctly now.
Adaption
- Megatron-LM 3.2 supported.
Documentation
- README is updated with some bugs fixed.
- A detailed document for process groups.
v1.0.1
Compatibility
- PyTorch 2.0 supported.
- Megatron-LM 2.5 supported.
Documentation
- A detailed [installation guide](installation-guide.md thanks to @santurini
Performance related
- Generalize FasterMoE's schedule to
n_expert > 1
and more bug fixes. - Synchronization reduction thanks to @Fragile-azalea
v1.0.0
FasterMoE
- The new performance boosting features in the PPoPP'22 paper FasterMoE, detailed in the document.
- Expert Shadowing.
- Smart Scheduling.
- Topology-aware gate.
Bug fixes
- Transformer-XL examples.
- Compatibility to PyTorch versions.
- Megatron-LM documents.
- GShardGate.
v0.3.0
FMoE core
- Previous
mp_group
is renamed toslice_group
, indicating that all workers in the group receive the same input batch, and process a slice of the input.mp_group
will be deprecated in our next release. - ROCm supported.
FMoELinear
is moved to a stand-alone file.
Groupped data parallel
- Support any group name by their relative tag name.
Load balancing
- A brand new balancing strategy - SWIPE. Contributed by authors of a (currently unpublished) paper.
- A property
has_loss
is added to each gate, in order to identify whether balance loss should be collected.
Megatron-LM support
- Experts are partitioned by tensor model parallelism in
mp_group
, instead of expert parallelism. - Support arbitrary customized gate in
MegatronMLP
. - Move the patches to a stand-alone file.
Tests
- Move util functions into
test_ddp.py
.
v0.2.1
Load balancing
- Fix gradient for balance loss.
Misc
- Typos.
- Update benchmark interface.
- Remove some redundant code for performance improvement.
- Enable
USE_NCCL
by default. - Compatibility for PyTorch
<1.8.0
and>=1.8.0
.
Megatron adaption
- Patch for numerical correctness of gradient clipping.
- Support to pipeline parallelism.
v0.2.0
Load balancing
- A brand new gate module with capacity-related utilities.
- GShard's and Switch Transformer's balance strategies are implemented as integrated gates.
- Balance loss is enabled.
- Balance monitor is provided.
Checkpointing
- MoE models can be loaded and saved by fmoe's checkpointing module.
Performance
- FP16 training performance is improved.
Misc
- CUDA code directory is reconstructed.
- More tests are added.