Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add zloss #133

Closed
wants to merge 7 commits into from
Closed

Add zloss #133

wants to merge 7 commits into from

Conversation

Muennighoff
Copy link

No description provided.

@Muennighoff Muennighoff mentioned this pull request Aug 3, 2024
@josejg josejg self-requested a review August 10, 2024 17:35
megablocks/layers/moe.py Outdated Show resolved Hide resolved
megablocks/layers/moe.py Outdated Show resolved Hide resolved
@dirkgr
Copy link

dirkgr commented Sep 6, 2024

Anything we can do to get this merged? We want to polish the OLMoE release, and it would be nice if we could just take a dependency on your version instead of @Muennighoff's.

@mvpatel2000
Copy link
Contributor

@Muennighoff can you resolve merge conflicts and make sure tests are passing?

CC: @josejg for review

@josejg
Copy link
Collaborator

josejg commented Sep 8, 2024

  1. I believe the current implementation will consume extra memory even when z-loss weight is zero. Even though the tensors are small, I always worry about keeping stuff around messing with the mem allocator. When we tested this, we had the zloss tensors separate from the LBL loss using a separate global variable and helper methods, which wouldn't save the logits if the zloss is disabled.

  2. I'm also not fan of changing the Router and Experts API (returns and arguments) since that is not backwards compatible. In our implementation, the z-loss was handled entirely in router.py, which prevented this.

  3. Similar for changing the API of batched_load_balancing_loss which now returns a tuple instead of a value, not backwards compatible. With our implementation the separate function prevents this.

  4. Lastly, we did not observe any meaningful improvements when enabling router zloss, so I would keep it at a default of 0 instead of 1e-3 to prevent the computation in the default case and for backwards compatibility. I see Figure 11 in the OLMoE paper reports stability improvements but I guess the training settings are different.

Here's our implementation for reference - main...josejg/zloss . The main difference is that the modeling code is responsible for doing clear_router_zloss and batched_router_zloss in the corresponding places.

@mvpatel2000 Wdyt?

@Muennighoff could you refactor the PR to address these comments? We can also merge our implementation giving you credit in the PR if that's easier.

@Muennighoff
Copy link
Author

oh just merge yours 👍

@josejg josejg mentioned this pull request Sep 9, 2024
7 tasks
@josejg
Copy link
Collaborator

josejg commented Sep 9, 2024

Sg, tracking it here - #151

@mihir-db
Copy link
Collaborator

mihir-db commented Sep 9, 2024

Closing in favor of #151
@Muennighoff @dirkgr please feel free to comment on other PR if you need any related changes or you can open a github issue :)

@mihir-db mihir-db closed this Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants