Capture short kernel sequences to graph #4318

inkcherry · 2023-09-13T10:42:27Z

Motivation:

This is a series of cases where short kernel sequences are launched and executed serially（no dynamic shape）, with the launch overhead being much higher than the execution overhead. We can use a graph to solve this problem. Compared to multi-tensor-apply, using graph is more concise and only requires PyTorch as a dependency.
Some device software stacks also support lazy-mode PyTorch, enabling full utilization of the compiler to perform graph optimization. However, in lazy mode, operation accumulation time (host time) could become significantly higher compared to device time in such scenario, and devices are usually not well utilized. By using the same API(after adding to accelerator cc @delock ) with cuda graph, this issue could also be resolved.

Change:
We modified three functions,
update_hp_grads. Here, we executed the operations for the CPU and GPU separately because the graph is unable to record the execution of CPU operations. Additionally, the data input required by the graph must not have its address modified, or the address modification must be captured by the capture operation(In this case, set replay_first_step to True). Therefore, we changed grad=None to grad.zero_(). Similarly, we have also placed some inputs that require fixed addresses in the graph_cache

For clip_tensors_by_global_norm, clip_coef is a scalar with a non-fixed value, so it needs to be moved to the GPU when using a graph.

For total_norm = sum ([t. data. float (). norm (norm_type). item () * * norm_type for t in input_tensors]), item () , synchronous operation is also not supported by graph. We directly put the sum and * * norm_type on the GPU to execute the computation.

Other similar scenarios can also use this graph_process(), or a slightly modified version of graph_process()

you can checkout
4abab21 and set it to True here to do some benchmarking.
4abab21#diff-f8f0b3feb55b0374615405e542c1c3e0f017982b177c46c562bf688532ac935cR42

tjruwase · 2023-09-13T15:13:01Z

@inkcherry, can you please give more description of this PR?

inkcherry · 2023-09-14T06:57:07Z

@inkcherry, can you please give more description of this PR?

@tjruwase added : )

deepspeed/runtime/utils.py

inkcherry · 2023-10-07T11:15:23Z

Sorry for not replying in time due to regional holidays~.
q2: this may lead to a crash or error.

import torch
b=None
def func(a):
    global b
    b = a
    for _ in range(10):
            b = b + 1
    
s = torch.cuda.Stream()
a = torch.full((1000,), 1, device="cuda")
static_mem = a.data_ptr()
with torch.cuda.stream(s):
    g = torch.cuda.CUDAGraph()
    torch.cuda.empty_cache()
    with torch.cuda.graph(g):
        func(a)
torch.cuda.current_stream().wait_stream(s)


# # 1 This may lead to a crash because the static_mem is freed, or if another variable reallocates static_mem, it will lead to incorrect behavior.
# a = None
# torch.cuda.empty_cache()
# #...
# g.replay()
# print(b.sum().item())  


# # 2 This will not crash but could produce incorrect results due to the memory change
# a = torch.full((1000,), 2, device="cuda")
# g.replay()
# print(b.sum().item())  

# #3 This is correct, we need to make the memory fix.
# a.copy_(torch.full((1000,), 2, device="cuda"))
# g.replay()
# print(b.sum().item())

So if the address changes, will lead to unexpected behavior.
q1:
To verify, first, 'func' is expected to use the same input in logic, like a fixed part of weight/gradient.
Then, check whether the address of this input remains unchanged every time func is called without a graph. For example, create and set string var for i-th call of _update_hp_grads_func here. https://github.com/microsoft/DeepSpeed/pull/4318/files#diff-f8f0b3feb55b0374615405e542c1c3e0f017982b177c46c562bf688532ac935cR286

i-th_call_update_hp_grads_mem_list += f"{hp_grad.data_ptr()},{lp.grad.data_ptr()}"

and check if each i-th_call_update_hp_grads_mem_list is the same at each rank. And it would be better If combined with the graph or kernel dump tool provided by the device.
I hope this answer is helpful to you：）

deepspeed/runtime/utils.py

accelerator interface for graph API

guoyejun · 2023-11-04T01:56:55Z

accelerator/cuda_accelerator.py

+    def create_graph(self):
+        return torch.cuda.CUDAGraph()
+
+    def capture_to_graph(self, graph):


please change the interface to add parameters such as (graph, pool=None, stream=None) to align with https://pytorch.org/docs/master/generated/torch.cuda.graph.html#torch.cuda.graph

this API at https://pytorch.org/docs/master/generated/torch.cuda.CUDAGraph.html#torch.cuda.CUDAGraph.pool is also important, it is used to share the memory pool between graphs, we can add this API in this PR or in future PR when it is really required.

deepspeed/runtime/zero/config.py

inkcherry · 2023-11-29T03:08:05Z

Thank you all for your reviews and suggestions.
I've made the some changes, could you please take a look @tjruwase ：）

inkcherry · 2023-11-30T02:31:32Z

It seems that the failure is not caused by this modification, and I can pass it locally. Could you please retrigger the check? Thank you very much! @tjruwase

tjruwase · 2023-12-13T15:31:10Z

@inkcherry, can you please check the formatting issue?

inkcherry · 2023-12-14T01:59:09Z

@inkcherry, can you please check the formatting issue?

Thanks for your reminder . I have now fixed the formatting. @tjruwase

inkcherry · 2023-12-19T02:15:02Z

It seems that this CI workflow is a bit unlucky. Two of the commit passed the ci check, while others seem to have encountered some failures that were not caused by this PR.
Sorry to bother you again， could you help me retry CI again. Thank you very much for your time. @tjruwase

tjruwase · 2023-12-19T03:07:31Z

@inkcherry, thanks it is no trouble at all. We appreciate your great contributions!

inkcherry · 2023-12-20T07:49:11Z

@tjruwase The CI has all passed. Just a reminder in case you missed it

@delock

**Motivation:** 1. This is a series of cases where short kernel sequences are launched and executed serially（no dynamic shape）, with the launch overhead being much higher than the execution overhead. We can use a graph to solve this problem. Compared to ```multi-tensor-apply```, using graph is more concise and only requires PyTorch as a dependency. 2. Some device software stacks also support lazy-mode PyTorch, enabling full utilization of the compiler to perform graph optimization. However, in lazy mode, operation accumulation time (host time) could become significantly higher compared to device time in such scenario, and devices are usually not well utilized. By using the same API(after adding to accelerator cc @delock ) with cuda graph, this issue could also be resolved. **Change:** We modified three functions, ```update_hp_grads```. Here, we executed the operations for the CPU and GPU separately because the graph is unable to record the execution of CPU operations. Additionally, the data input required by the graph must not have its address modified, or the address modification must be captured by the capture operation(In this case, set ```replay_first_step``` to ```True```). Therefore, we changed ```grad=None``` to ```grad.zero_()```. Similarly, we have also placed some inputs that require fixed addresses in the ```graph_cache``` For ```clip_tensors_by_global_norm```, ```clip_coef``` is a scalar with a non-fixed value, so it needs to be moved to the GPU when using a graph. For ```total_norm = sum ([t. data. float (). norm (norm_type). item () * * norm_type for t in input_tensors])```, ```item () ```, synchronous operation is also not supported by graph. We directly put the ```sum``` and ```* * norm_type``` on the GPU to execute the computation. Other similar scenarios can also use this ```graph_process()```, or a slightly modified version of ```graph_process()``` you can checkout [4abab21](microsoft@4abab21) and set it to True here to do some benchmarking. microsoft@4abab21#diff-f8f0b3feb55b0374615405e542c1c3e0f017982b177c46c562bf688532ac935cR42 --------- Co-authored-by: Olatunji Ruwase <[email protected]>

inkcherry added 3 commits September 13, 2023 10:20

graph_capture

4abab21

remove print

3a770b5

update

5821e85

inkcherry requested review from jeffra and tjruwase as code owners September 13, 2023 10:42

fix format

f2e79e6

tjruwase reviewed Sep 27, 2023

View reviewed changes

deepspeed/runtime/utils.py Show resolved Hide resolved

tjruwase added 2 commits October 3, 2023 17:24

Merge branch 'master' into graph_capture

7e78b39

Merge branch 'master' into graph_capture

7e666be

tjruwase reviewed Oct 25, 2023

View reviewed changes

deepspeed/runtime/utils.py Outdated Show resolved Hide resolved

tjruwase reviewed Oct 25, 2023

View reviewed changes

deepspeed/runtime/utils.py Show resolved Hide resolved

inkcherry added 2 commits November 3, 2023 06:04

accelerator interface for graph API

65df996

accelerator interface for graph API

3aa74fc

accelerator interface for graph API

inkcherry requested review from RezaYazdaniAminabadi, mrwyattii, awan-10, cmikeh2 and arashb as code owners November 3, 2023 09:40

inkcherry added 2 commits November 3, 2023 17:41

Merge branch 'master' into graph_capture

7c66af9

add flag

d37ca81

guoyejun reviewed Nov 4, 2023

View reviewed changes

align args

57b4dc3

tjruwase reviewed Nov 8, 2023

View reviewed changes

deepspeed/runtime/zero/config.py Outdated Show resolved Hide resolved

tjruwase and others added 4 commits November 8, 2023 10:41

Merge branch 'master' into graph_capture

6c5e1d7

Merge branch 'master' into graph_capture

c441bca

Move the graph_harvesting namespace scope to runtime

435dea2

fix typo

1229c11

delock mentioned this pull request Nov 27, 2023

(Do not merge) (CPU) aggregation of few recent fixes/optimizations #3920

Closed

25 tasks

Merge branch 'master' into graph_capture

e761475

tjruwase approved these changes Nov 29, 2023

View reviewed changes

tjruwase added 4 commits November 30, 2023 08:56

Merge branch 'master' into graph_capture

cf93986

Merge branch 'master' into graph_capture

baf65d5

Merge branch 'master' into graph_capture

b6c7a3f

Merge branch 'master' into graph_capture

9a7578b

fix format

0d8ebb1

tjruwase and others added 2 commits December 13, 2023 20:05

Merge branch 'master' into graph_capture

a7f5c10

Merge branch 'master' into graph_capture

338ebd5

tjruwase added this pull request to the merge queue Dec 20, 2023

Merged via the queue into microsoft:master with commit d5a7c1e Dec 20, 2023
15 checks passed

delock mentioned this pull request Sep 20, 2024

[TRACKER] Customer support related PR tracker for Intel devices #6556

Open

23 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Capture short kernel sequences to graph #4318

Capture short kernel sequences to graph #4318

inkcherry commented Sep 13, 2023 •

edited

Loading

tjruwase commented Sep 13, 2023

inkcherry commented Sep 14, 2023

inkcherry commented Oct 7, 2023 •

edited

Loading

guoyejun Nov 4, 2023

guoyejun Nov 4, 2023

inkcherry commented Nov 29, 2023

inkcherry commented Nov 30, 2023

tjruwase commented Dec 13, 2023

inkcherry commented Dec 14, 2023

inkcherry commented Dec 19, 2023

tjruwase commented Dec 19, 2023

inkcherry commented Dec 20, 2023

Capture short kernel sequences to graph #4318

Capture short kernel sequences to graph #4318

Conversation

inkcherry commented Sep 13, 2023 • edited Loading

tjruwase commented Sep 13, 2023

inkcherry commented Sep 14, 2023

inkcherry commented Oct 7, 2023 • edited Loading

guoyejun Nov 4, 2023

Choose a reason for hiding this comment

guoyejun Nov 4, 2023

Choose a reason for hiding this comment

inkcherry commented Nov 29, 2023

inkcherry commented Nov 30, 2023

tjruwase commented Dec 13, 2023

inkcherry commented Dec 14, 2023

inkcherry commented Dec 19, 2023

tjruwase commented Dec 19, 2023

inkcherry commented Dec 20, 2023

inkcherry commented Sep 13, 2023 •

edited

Loading

inkcherry commented Oct 7, 2023 •

edited

Loading