Fix: Update wrong typing on the function get_local_ranks #194

morgangiraud · 2024-06-10T09:40:42Z

What does this PR do?

Fixing the return type of the function get_local_ranks
Update the associated test to ensure, it does not get out-of-sync with the underlying code anymore

Why?

np.where(self.world_rank_matrix == world_rank) returns a single "data point" of the following matrix:

ranks = np.arange(0, world_size).reshape(
            (
                self.expert_parallel_size,
                self.pipeline_parallel_size,
                self.data_parallel_size,
                self.tensor_parallel_size,
            )
        )
self.world_rank_matrix: np.ndarray = ranks

Which is a 4d matrix, so the "data point" is a 4d tuple of np.array of 1 element.

morgangiraud · 2024-06-10T10:38:40Z

Btw, quick question for @thomwolf (Apologies if I'm mistaken or if this seems like nitpicking. I noticed you pushed this code on the first commit and I appreciate good naming in general 🙂):

Shouldn't get_local_ranks be renamed to get_group_ranks?

I'm asking because it retrieves the rank across the different pp/tp/dp/ep process groups, not the rank inside each node (which is named LOCAL_RANK as an environment variable and is used to assign to each device within a node).

Fix: Update wrong typing on the function get_local_ranks

d731746

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Update wrong typing on the function get_local_ranks #194

Fix: Update wrong typing on the function get_local_ranks #194

morgangiraud commented Jun 10, 2024

morgangiraud commented Jun 10, 2024

Fix: Update wrong typing on the function get_local_ranks #194

Are you sure you want to change the base?

Fix: Update wrong typing on the function get_local_ranks #194

Conversation

morgangiraud commented Jun 10, 2024

What does this PR do?

Why?

morgangiraud commented Jun 10, 2024