Distributed Reduction #18206

wschin · 2023-10-31T22:59:37Z

This PR implements distributed reduciton for llama 2. This version doesn't consider any cases requring re-sharding because we haven't seen any use cases.

Intutive examples:

[supported] [2,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1] -> Reduce(axes=[0]) -> [1,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1]
[supported] [2,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1] -> Reduce(axes=[1]) -> [2,1,6]-tensor with spec=RRS[0] and device_mesh=[0,1]
[not supported] [2,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1] -> Reduce(axes=[2]) -> [2,4,1]-tensor with spec=RRS[0] and device_mesh=[0,1]

Algorithm:
When the reduced axes are not sharded, each device can call reduction directly. The output sharding spec will be identical to input sharding spec. We currently throw when input and output sharding specs are different.

Review guideline:

Check 97b8d2f for new op's schema and how new op is registered.
Read tests in 2450f93 to get faimilar with the behavior of these ops.
Check the implementation details in 753d9af.

Fix schema Drop int64_t case

More tests

onnxruntime/contrib_ops/cuda/collective/distributed_reduce.cc

This PR implements distributed reduciton for llama 2. This version doesn't consider any cases requring re-sharding because we haven't seen any use cases. Intutive examples: - [supported] [2,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1] -> Reduce(axes=[0]) -> [1,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1] - [supported] [2,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1] -> Reduce(axes=[1]) -> [2,1,6]-tensor with spec=RRS[0] and device_mesh=[0,1] - [not supported] [2,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1] -> Reduce(axes=[2]) -> [2,4,1]-tensor with spec=RRS[0] and device_mesh=[0,1] Algorithm: When the reduced axes are not sharded, each device can call reduction directly. The output sharding spec will be identical to input sharding spec. We currently throw when input and output sharding specs are different. Review guideline: - Check 97b8d2f for new op's schema and how new op is registered. - Read tests in 2450f93 to get faimilar with the behavior of these ops. - Check the implementation details in 753d9af.

wschin added 3 commits October 31, 2023 16:02

Skeleton of distributed reduction

97b8d2f

Fix schema Drop int64_t case

Some tests

2450f93

More tests

Implementation of DistributedReduce

753d9af

wschin force-pushed the wechi/d-reduce branch from d520aa3 to 753d9af Compare October 31, 2023 23:03

wschin marked this pull request as ready for review October 31, 2023 23:16

wschin requested a review from souptc October 31, 2023 23:17

wschin added 2 commits October 31, 2023 16:18

Fix build

e92b8d0

Fix template

8553c1e

wschin force-pushed the wechi/d-reduce branch from 4c53973 to 8553c1e Compare November 1, 2023 02:23

github-advanced-security bot found potential problems Nov 1, 2023

View reviewed changes

onnxruntime/contrib_ops/cuda/collective/distributed_reduce.cc Fixed Show fixed Hide fixed

souptc previously approved these changes Nov 1, 2023

View reviewed changes

lint

481c360

wschin dismissed souptc’s stale review via 481c360 November 1, 2023 07:10

souptc approved these changes Nov 1, 2023

View reviewed changes

wschin merged commit 9e8ad39 into main Nov 1, 2023
89 of 91 checks passed

wschin deleted the wechi/d-reduce branch November 1, 2023 15:49

axelman03 mentioned this pull request Nov 1, 2023

Issue when converting Whisper using --collect_cross_qk on CPU #18216

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed Reduction #18206

Distributed Reduction #18206

wschin commented Oct 31, 2023 •

edited

Loading

Distributed Reduction #18206

Distributed Reduction #18206

Conversation

wschin commented Oct 31, 2023 • edited Loading

wschin commented Oct 31, 2023 •

edited

Loading