-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Capture short kernel sequences to graph #4318
Conversation
@inkcherry, can you please give more description of this PR? |
@tjruwase added : ) |
Sorry for not replying in time due to regional holidays~. import torch
b=None
def func(a):
global b
b = a
for _ in range(10):
b = b + 1
s = torch.cuda.Stream()
a = torch.full((1000,), 1, device="cuda")
static_mem = a.data_ptr()
with torch.cuda.stream(s):
g = torch.cuda.CUDAGraph()
torch.cuda.empty_cache()
with torch.cuda.graph(g):
func(a)
torch.cuda.current_stream().wait_stream(s)
# # 1 This may lead to a crash because the static_mem is freed, or if another variable reallocates static_mem, it will lead to incorrect behavior.
# a = None
# torch.cuda.empty_cache()
# #...
# g.replay()
# print(b.sum().item())
# # 2 This will not crash but could produce incorrect results due to the memory change
# a = torch.full((1000,), 2, device="cuda")
# g.replay()
# print(b.sum().item())
# #3 This is correct, we need to make the memory fix.
# a.copy_(torch.full((1000,), 2, device="cuda"))
# g.replay()
# print(b.sum().item()) So if the address changes, will lead to unexpected behavior. i-th_call_update_hp_grads_mem_list += f"{hp_grad.data_ptr()},{lp.grad.data_ptr()}" and check if each |
accelerator interface for graph API
accelerator/cuda_accelerator.py
Outdated
def create_graph(self): | ||
return torch.cuda.CUDAGraph() | ||
|
||
def capture_to_graph(self, graph): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please change the interface to add parameters such as (graph, pool=None, stream=None) to align with https://pytorch.org/docs/master/generated/torch.cuda.graph.html#torch.cuda.graph
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this API at https://pytorch.org/docs/master/generated/torch.cuda.CUDAGraph.html#torch.cuda.CUDAGraph.pool is also important, it is used to share the memory pool between graphs, we can add this API in this PR or in future PR when it is really required.
Thank you all for your reviews and suggestions. |
It seems that the failure is not caused by this modification, and I can pass it locally. Could you please retrigger the check? Thank you very much! @tjruwase |
@inkcherry, can you please check the formatting issue? |
Thanks for your reminder . I have now fixed the formatting. @tjruwase |
It seems that this CI workflow is a bit unlucky. Two of the commit passed the ci check, while others seem to have encountered some failures that were not caused by this PR. |
@inkcherry, thanks it is no trouble at all. We appreciate your great contributions! |
@tjruwase The CI has all passed. Just a reminder in case you missed it |
**Motivation:** 1. This is a series of cases where short kernel sequences are launched and executed serially(no dynamic shape), with the launch overhead being much higher than the execution overhead. We can use a graph to solve this problem. Compared to ```multi-tensor-apply```, using graph is more concise and only requires PyTorch as a dependency. 2. Some device software stacks also support lazy-mode PyTorch, enabling full utilization of the compiler to perform graph optimization. However, in lazy mode, operation accumulation time (host time) could become significantly higher compared to device time in such scenario, and devices are usually not well utilized. By using the same API(after adding to accelerator cc @delock ) with cuda graph, this issue could also be resolved. **Change:** We modified three functions, ```update_hp_grads```. Here, we executed the operations for the CPU and GPU separately because the graph is unable to record the execution of CPU operations. Additionally, the data input required by the graph must not have its address modified, or the address modification must be captured by the capture operation(In this case, set ```replay_first_step``` to ```True```). Therefore, we changed ```grad=None``` to ```grad.zero_()```. Similarly, we have also placed some inputs that require fixed addresses in the ```graph_cache``` For ```clip_tensors_by_global_norm```, ```clip_coef``` is a scalar with a non-fixed value, so it needs to be moved to the GPU when using a graph. For ```total_norm = sum ([t. data. float (). norm (norm_type). item () * * norm_type for t in input_tensors])```, ```item () ```, synchronous operation is also not supported by graph. We directly put the ```sum``` and ```* * norm_type``` on the GPU to execute the computation. Other similar scenarios can also use this ```graph_process()```, or a slightly modified version of ```graph_process()``` you can checkout [4abab21](microsoft@4abab21) and set it to True here to do some benchmarking. microsoft@4abab21#diff-f8f0b3feb55b0374615405e542c1c3e0f017982b177c46c562bf688532ac935cR42 --------- Co-authored-by: Olatunji Ruwase <[email protected]>
Motivation:
multi-tensor-apply
, using graph is more concise and only requires PyTorch as a dependency.Change:
We modified three functions,
update_hp_grads
. Here, we executed the operations for the CPU and GPU separately because the graph is unable to record the execution of CPU operations. Additionally, the data input required by the graph must not have its address modified, or the address modification must be captured by the capture operation(In this case, setreplay_first_step
toTrue
). Therefore, we changedgrad=None
tograd.zero_()
. Similarly, we have also placed some inputs that require fixed addresses in thegraph_cache
For
clip_tensors_by_global_norm
,clip_coef
is a scalar with a non-fixed value, so it needs to be moved to the GPU when using a graph.For
total_norm = sum ([t. data. float (). norm (norm_type). item () * * norm_type for t in input_tensors])
,item ()
, synchronous operation is also not supported by graph. We directly put thesum
and* * norm_type
on the GPU to execute the computation.Other similar scenarios can also use this
graph_process()
, or a slightly modified version ofgraph_process()
you can checkout
4abab21 and set it to True here to do some benchmarking.
4abab21#diff-f8f0b3feb55b0374615405e542c1c3e0f017982b177c46c562bf688532ac935cR42