cuda graph enhancement #19636

wangyems · 2024-02-24T01:03:26Z

Description

add a config key in run_options to control cuda graph in runtime.
enhance cuda graph class to support mutiple graph saving and retrieving in one ORT session
provide model modification/inference example on Phi2
benchmark shows an average of 13% latency reduction in token generation.

limitation: TRT ep and ROCM ep hasn't applied this feature. we can revisit this in the future.

Motivation and Context

onnxruntime/python/tools/transformers/models/phi2/inference_example.py

-    def generate(self, prompt, max_length):
-        encodings_dict = self.tokenizer.batch_encode_plus(prompt, padding=True)
-
+    def generate_impl(self, encodings_dict, max_length, cuda_graph_annotation, benchmark=False):


onnxruntime/core/providers/cuda/cuda_graph.cc

include/onnxruntime/core/framework/execution_provider.h

onnxruntime/core/providers/cuda/cuda_graph.h

onnxruntime/core/providers/cuda/cuda_graph.cc

onnxruntime/core/providers/cuda/cuda_execution_provider.cc

onnxruntime/core/providers/cuda/cuda_execution_provider.h

onnxruntime/core/providers/cuda/cuda_graph.h

include/onnxruntime/core/framework/execution_provider.h

onnxruntime/core/session/inference_session.h

onnxruntime/core/session/inference_session.cc

include/onnxruntime/core/session/onnxruntime_run_options_config_keys.h

onnxruntime/core/providers/cuda/cuda_graph.h

tianleiwu · 2024-02-27T20:05:06Z

Currently we do not protect tensors copied to GPU memory. That means, when capture another cuda graph, those tensors might be overwritten by another run.
Is it possible to protect those tensors from MemcpyFromHost output when cuda graph annotation is enabled?

Edit: currently we do not allow CUDA Graph for a model with MemcpyFromHost so it is fine right now. We can treat this as feature request to support model with MemcpyFromHost node. It need not be done in this pull request.

include/onnxruntime/core/session/onnxruntime_run_options_config_keys.h

onnxruntime/core/providers/cuda/cuda_graph.cc

onnxruntime/test/python/onnxruntime_test_python_cudagraph.py

…uda_graph_run_options

onnxruntime/core/providers/cuda/cuda_execution_provider.cc

…uda_graph_run_options

onnxruntime/core/providers/cuda/cuda_graph.cc

onnxruntime/core/session/inference_session.cc

onnxruntime/test/python/onnxruntime_test_python_cudagraph.py

tianleiwu

LGTM.

Please add another PR to update the document for the new run option.

onnxruntime/core/providers/cuda/cuda_graph.cc

…uda_graph_run_options

onnxruntime/core/session/inference_session.h

onnxruntime/core/providers/cuda/cuda_graph.cc

onnxruntime/test/python/onnxruntime_test_python_cudagraph.py

…uda_graph_run_options

### Description  docs for #19636 ### Motivation and Context

tianleiwu · 2024-03-11T16:35:48Z

onnxruntime/core/providers/cuda/cuda_execution_provider.cc

-  return regular_run_count_before_graph_capture_ >= min_num_runs_before_cuda_graph_capture_;
+bool CUDAExecutionProvider::PerThreadContext::IsGraphCaptureAllowed(
+    CudaGraphAnnotation_t cuda_graph_annotation_id) const {
+  return regular_run_count_before_graph_capture_ >= min_num_runs_before_cuda_graph_capture_ &&


Need regular run counter for each cuda_graph_annotation_id.

Good point.

#19856 for bug fix

cuda graph enhancement

bcb90a9

github-advanced-security bot found potential problems Feb 24, 2024

View reviewed changes

wangyems added 2 commits February 24, 2024 01:28

update

03a3a80

add python test

772471c

wangyems marked this pull request as ready for review February 26, 2024 17:02

wangyems requested review from tianleiwu and hariharans29 February 26, 2024 17:03