Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update docs to include multi-cudagraph support #19818

Merged
merged 2 commits into from
Mar 7, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 25 additions & 10 deletions docs/execution-providers/CUDA-ExecutionProvider.md
Original file line number Diff line number Diff line change
Expand Up @@ -338,8 +338,11 @@ shown below) if [N, C, 1, D] is preferred.

While using the CUDA EP, ORT supports the usage
of [CUDA Graphs](https://developer.nvidia.com/blog/cuda-10-features-revealed/) to remove CPU overhead associated with
launching CUDA kernels sequentially. To enable the usage of CUDA Graphs, use the provider option as shown in the samples
below.
launching CUDA kernels sequentially. To enable the usage of CUDA Graphs, use the provider options as shown in the samples
below. ORT supports multi-graph capture capability by passing the user specified gpu_graph_id to the run options.
gpu_graph_id is optional when the session uses one cuda graph. If not set, the default value is 0. If the gpu_graph_id is
set to -1, cuda graph capture/replay is disabled in that run.

Currently, there are some constraints with regards to using the CUDA Graphs feature:

* Models with control-flow ops (i.e. `If`, `Loop` and `Scan` ops) are not supported.
Expand All @@ -348,8 +351,11 @@ Currently, there are some constraints with regards to using the CUDA Graphs feat

* The input/output types of models need to be tensors.

* Shapes of inputs/outputs cannot change across inference calls. Dynamic shape models are supported - the only
constraint is that the input/output shapes should be the same across all inference calls.
* Shapes and addresses of inputs/outputs cannot change across inference calls for the same graph annotation id. Input
tensors for replay shall be copied to the address of input tensors used in graph capture.

* In multi-graph capture mode, the captured graphs will remain in the session's lifetime and the captured graph deletion
feature is not supported at the moment.

* By design, [CUDA Graphs](https://developer.nvidia.com/blog/cuda-10-features-revealed/) is designed to read from/write
to the same CUDA virtual memory addresses during the graph replaying step as it does during the graph capturing step.
Expand Down Expand Up @@ -385,22 +391,27 @@ captured and cached in the first `Run()`.
session = onnxrt.InferenceSession("matmul_2.onnx", providers=providers)
io_binding = session.io_binding()

# Pass gpu_graph_id to RunOptions through RunConfigs
wangyems marked this conversation as resolved.
Show resolved Hide resolved
ro = onnxrt.RunOptions()
# gpu_graph_id is optional if the session uses only one cuda graph
ro.add_run_config_entry("gpu_graph_id", "1")

# Bind the input and output
io_binding.bind_ortvalue_input('X', x_ortvalue)
io_binding.bind_ortvalue_output('Y', y_ortvalue)

# One regular run for the necessary memory allocation and cuda graph capturing
session.run_with_iobinding(io_binding)
session.run_with_iobinding(io_binding, ro)
expected_y = np.array([[5.0], [11.0], [17.0]], dtype=np.float32)
np.testing.assert_allclose(expected_y, y_ortvalue.numpy(), rtol=1e-05, atol=1e-05)

# After capturing, CUDA graph replay happens from this Run onwards
session.run_with_iobinding(io_binding)
session.run_with_iobinding(io_binding, ro)
np.testing.assert_allclose(expected_y, y_ortvalue.numpy(), rtol=1e-05, atol=1e-05)

# Update input and then replay CUDA graph with the updated input
x_ortvalue.update_inplace(np.array([[10.0, 20.0], [30.0, 40.0], [50.0, 60.0]], dtype=np.float32))
session.run_with_iobinding(io_binding)
session.run_with_iobinding(io_binding, ro)
```
* C/C++
```c++
Expand Down Expand Up @@ -429,6 +440,10 @@ captured and cached in the first `Run()`.
Ort::SessionOptions session_options;
api.SessionOptionsAppendExecutionProvider_CUDA_V2(static_cast<OrtSessionOptions*>(session_options), rel_cuda_options.get();

// Pass gpu_graph_id to RunOptions through RunConfigs
Ort::RunOptions run_option;
// gpu_graph_id is optional if the session uses only one cuda graph
run_option.AddConfigEntry("gpu_graph_id", "1");

// Create IO bound inputs and outputs.
Ort::Session session(*ort_env, ORT_TSTR("matmul_2.onnx"), session_options);
Expand Down Expand Up @@ -459,15 +474,15 @@ captured and cached in the first `Run()`.
binding.BindOutput("Y", bound_y);

// One regular run for necessary memory allocation and graph capturing
session.Run(Ort::RunOptions(), binding);
session.Run(run_option, binding);

// After capturing, CUDA graph replay happens from this Run onwards
session.Run(Ort::RunOptions(), binding);
session.Run(run_option, binding);

// Update input and then replay CUDA graph with the updated input
x_values = {10.0f, 20.0f, 30.0f, 40.0f, 50.0f, 60.0f};
cudaMemcpy(input_data.get(), x_values.data(), sizeof(float) * x_values.size(), cudaMemcpyHostToDevice);
session.Run(Ort::RunOptions(), binding);
session.Run(run_option, binding);
```

* C# (future)
Expand Down
Loading