Skip to content

Commit

Permalink
update docs to include multi-cudagraph support (#19818)
Browse files Browse the repository at this point in the history
### Description
<!-- Describe your changes. -->

docs for #19636

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
  • Loading branch information
wangyems authored Mar 7, 2024
1 parent 3e1e864 commit 705e0e7
Showing 1 changed file with 25 additions and 10 deletions.
35 changes: 25 additions & 10 deletions docs/execution-providers/CUDA-ExecutionProvider.md
Original file line number Diff line number Diff line change
Expand Up @@ -338,8 +338,11 @@ shown below) if [N, C, 1, D] is preferred.

While using the CUDA EP, ORT supports the usage
of [CUDA Graphs](https://developer.nvidia.com/blog/cuda-10-features-revealed/) to remove CPU overhead associated with
launching CUDA kernels sequentially. To enable the usage of CUDA Graphs, use the provider option as shown in the samples
below.
launching CUDA kernels sequentially. To enable the usage of CUDA Graphs, use the provider options as shown in the samples
below. ORT supports multi-graph capture capability by passing the user specified gpu_graph_id to the run options.
gpu_graph_id is optional when the session uses one cuda graph. If not set, the default value is 0. If the gpu_graph_id is
set to -1, cuda graph capture/replay is disabled in that run.

Currently, there are some constraints with regards to using the CUDA Graphs feature:

* Models with control-flow ops (i.e. `If`, `Loop` and `Scan` ops) are not supported.
Expand All @@ -348,8 +351,11 @@ Currently, there are some constraints with regards to using the CUDA Graphs feat

* The input/output types of models need to be tensors.

* Shapes of inputs/outputs cannot change across inference calls. Dynamic shape models are supported - the only
constraint is that the input/output shapes should be the same across all inference calls.
* Shapes and addresses of inputs/outputs cannot change across inference calls for the same graph annotation id. Input
tensors for replay shall be copied to the address of input tensors used in graph capture.

* In multi-graph capture mode, the captured graphs will remain in the session's lifetime and the captured graph deletion
feature is not supported at the moment.

* By design, [CUDA Graphs](https://developer.nvidia.com/blog/cuda-10-features-revealed/) is designed to read from/write
to the same CUDA virtual memory addresses during the graph replaying step as it does during the graph capturing step.
Expand Down Expand Up @@ -385,22 +391,27 @@ captured and cached in the first `Run()`.
session = onnxrt.InferenceSession("matmul_2.onnx", providers=providers)
io_binding = session.io_binding()

# Pass gpu_graph_id to RunOptions through RunConfigs
ro = onnxrt.RunOptions()
# gpu_graph_id is optional if the session uses only one cuda graph
ro.add_run_config_entry("gpu_graph_id", "1")

# Bind the input and output
io_binding.bind_ortvalue_input('X', x_ortvalue)
io_binding.bind_ortvalue_output('Y', y_ortvalue)

# One regular run for the necessary memory allocation and cuda graph capturing
session.run_with_iobinding(io_binding)
session.run_with_iobinding(io_binding, ro)
expected_y = np.array([[5.0], [11.0], [17.0]], dtype=np.float32)
np.testing.assert_allclose(expected_y, y_ortvalue.numpy(), rtol=1e-05, atol=1e-05)

# After capturing, CUDA graph replay happens from this Run onwards
session.run_with_iobinding(io_binding)
session.run_with_iobinding(io_binding, ro)
np.testing.assert_allclose(expected_y, y_ortvalue.numpy(), rtol=1e-05, atol=1e-05)

# Update input and then replay CUDA graph with the updated input
x_ortvalue.update_inplace(np.array([[10.0, 20.0], [30.0, 40.0], [50.0, 60.0]], dtype=np.float32))
session.run_with_iobinding(io_binding)
session.run_with_iobinding(io_binding, ro)
```
* C/C++
```c++
Expand Down Expand Up @@ -429,6 +440,10 @@ captured and cached in the first `Run()`.
Ort::SessionOptions session_options;
api.SessionOptionsAppendExecutionProvider_CUDA_V2(static_cast<OrtSessionOptions*>(session_options), rel_cuda_options.get();

// Pass gpu_graph_id to RunOptions through RunConfigs
Ort::RunOptions run_option;
// gpu_graph_id is optional if the session uses only one cuda graph
run_option.AddConfigEntry("gpu_graph_id", "1");

// Create IO bound inputs and outputs.
Ort::Session session(*ort_env, ORT_TSTR("matmul_2.onnx"), session_options);
Expand Down Expand Up @@ -459,15 +474,15 @@ captured and cached in the first `Run()`.
binding.BindOutput("Y", bound_y);

// One regular run for necessary memory allocation and graph capturing
session.Run(Ort::RunOptions(), binding);
session.Run(run_option, binding);

// After capturing, CUDA graph replay happens from this Run onwards
session.Run(Ort::RunOptions(), binding);
session.Run(run_option, binding);

// Update input and then replay CUDA graph with the updated input
x_values = {10.0f, 20.0f, 30.0f, 40.0f, 50.0f, 60.0f};
cudaMemcpy(input_data.get(), x_values.data(), sizeof(float) * x_values.size(), cudaMemcpyHostToDevice);
session.Run(Ort::RunOptions(), binding);
session.Run(run_option, binding);
```

* C# (future)
Expand Down

0 comments on commit 705e0e7

Please sign in to comment.