Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Update wrong typing on the function get_local_ranks #194

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

morgangiraud
Copy link

What does this PR do?

  • Fixing the return type of the function get_local_ranks
  • Update the associated test to ensure, it does not get out-of-sync with the underlying code anymore

Why?

np.where(self.world_rank_matrix == world_rank) returns a single "data point" of the following matrix:

ranks = np.arange(0, world_size).reshape(
            (
                self.expert_parallel_size,
                self.pipeline_parallel_size,
                self.data_parallel_size,
                self.tensor_parallel_size,
            )
        )
self.world_rank_matrix: np.ndarray = ranks

Which is a 4d matrix, so the "data point" is a 4d tuple of np.array of 1 element.

@morgangiraud
Copy link
Author

Btw, quick question for @thomwolf (Apologies if I'm mistaken or if this seems like nitpicking. I noticed you pushed this code on the first commit and I appreciate good naming in general 🙂):

Shouldn't get_local_ranks be renamed to get_group_ranks?

I'm asking because it retrieves the rank across the different pp/tp/dp/ep process groups, not the rank inside each node (which is named LOCAL_RANK as an environment variable and is used to assign to each device within a node).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant