- Smart schedule of FasterMoE is updated with correct stream management, and becomes faster.
- All unit tests are checked and they run correctly now.
- Megatron-LM 3.2 supported.
- README is updated with some bugs fixed.
- A detailed document for process groups.
- PyTorch 2.0 supported.
- Megatron-LM 2.5 supported.
- A detailed installation-guide thanks to @santurini
- Generalize FasterMoE's schedule to
n_expert > 1
and more bug fixes. - Synchronization reduction thanks to @Fragile-azalea
- The new performance boosting features in the PPoPP'22 paper FasterMoE, detailed in the document.
- Expert Shadowing.
- Smart Scheduling.
- Topology-aware gate.
- Transformer-XL examples.
- Compatibility to PyTorch versions.
- Megatron-LM documents.
- GShardGate.
- Previous
mp_group
is renamed toslice_group
, indicating that all workers in the group receive the same input batch, and process a slice of the input.mp_group
will be deprecated in our next release. - ROCm supported.
FMoELinear
is moved to a stand-alone file.
- Support any group name by their relative tag name.
- A brand new balancing strategy - SWIPE. Contributed by authors of a (currently unpublished) paper.
- A property
has_loss
is added to each gate, in order to identify whether balance loss should be collected.
- Experts are partitioned by tensor model parallelism in
mp_group
, instead of expert parallelism. - Support arbitrary customized gate in
MegatronMLP
. - Move the patches to a stand-alone file.
- Move util functions into
test_ddp.py
.
- Fix gradient for balance loss.
- Typos.
- Update benchmark interface.
- Remove some redundant code for performance improvement.
- Enable
USE_NCCL
by default. - Compatibility for PyTorch
<1.8.0
and>=1.8.0
.
- Patch for numerical correctness of gradient clipping.
- Support to pipeline parallelism.
- A brand new gate module with capacity-related utilities.
- GShard's and Switch Transformer's balance strategies are implemented as integrated gates.
- Balance loss is enabled.
- Balance monitor is provided.
- MoE models can be loaded and saved by fmoe's checkpointing module.
- FP16 training performance is improved.
- CUDA code directory is reconstructed.
- More tests are added.
- Remove dependency on the CUDA examples repository.
- Fix a bug related to PyTorch v1.8.0. FastMoE can now operate on multiple GPUs on multiple nodes with PyTorch v1.8.0.
- Fix tons of typos.
- Format the code.
- Broadcast data-parallel parameters before training.
- Initialize
FMoELinear
parameters using different seed in model parallel even using the same random seed in megatron. - Use proper comm for mp and dp.
- Improve scripts.
- Logo and slack workspace link.
- Document in Chinese.
- Figures to explain how FastMoE works.
- A model-injection-style easy-to-use user interface for Megatron-LM.
- Support both data parallel and model parallel, and a hybrid of the two,
- Provide a new customized DDP module to synchronize in different comm groups.
- Support to customized
nn.Module
as an expert.
- Use PyTest.
- Setup PyLint.
- Installation and usage guide.
- Explanation of functions and code structure in code.
- A benchmark to compare FastMoE and old PyTorch impl.