Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Z-loss is an additional term that improves the stability of softmax logit inputs by penalizing large values where the float precision is reduced.
Idea was introduced by the ST-MoE Paper - https://arxiv.org/abs/2202.08906. And has been recently used in the OLMoE work - https://arxiv.org/abs/2409.02060.
What issue(s) does this change relate to?
Alternative implementation of #133
Before submitting
test_moe_forward_backward_with_zloss
andtest_moe_forward_backward_with_zloss
pre-commit
on your change? (see thepre-commit
section of prerequisites)