Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda graph enhancement #19636

Merged
merged 31 commits into from
Mar 7, 2024
Merged

cuda graph enhancement #19636

merged 31 commits into from
Mar 7, 2024

Conversation

wangyems
Copy link
Contributor

@wangyems wangyems commented Feb 24, 2024

Description

  1. add a config key in run_options to control cuda graph in runtime.
  2. enhance cuda graph class to support mutiple graph saving and retrieving in one ORT session
  3. provide model modification/inference example on Phi2
  4. benchmark shows an average of 13% latency reduction in token generation.

limitation: TRT ep and ROCM ep hasn't applied this feature. we can revisit this in the future.

Motivation and Context

def generate(self, prompt, max_length):
encodings_dict = self.tokenizer.batch_encode_plus(prompt, padding=True)

def generate_impl(self, encodings_dict, max_length, cuda_graph_annotation, benchmark=False):

Check notice

Code scanning / CodeQL

Explicit returns mixed with implicit (fall through) returns Note

Mixing implicit and explicit returns may indicate an error as implicit returns always return None.
@wangyems wangyems marked this pull request as ready for review February 26, 2024 17:02
@hanbitmyths hanbitmyths requested a review from souptc February 27, 2024 06:55
@jywu-msft jywu-msft requested a review from chilo-ms February 27, 2024 17:03
@tianleiwu
Copy link
Contributor

tianleiwu commented Feb 27, 2024

Currently we do not protect tensors copied to GPU memory. That means, when capture another cuda graph, those tensors might be overwritten by another run.
Is it possible to protect those tensors from MemcpyFromHost output when cuda graph annotation is enabled?

Edit: currently we do not allow CUDA Graph for a model with MemcpyFromHost so it is fine right now. We can treat this as feature request to support model with MemcpyFromHost node. It need not be done in this pull request.

Copy link
Contributor

@tianleiwu tianleiwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Please add another PR to update the document for the new run option.

hariharans29
hariharans29 previously approved these changes Mar 7, 2024
@wangyems wangyems merged commit 72ce4de into main Mar 7, 2024
93 of 95 checks passed
@wangyems wangyems deleted the wangye/cuda_graph_run_options branch March 7, 2024 18:15
wangyems added a commit that referenced this pull request Mar 7, 2024
### Description
<!-- Describe your changes. -->

docs for #19636

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
return regular_run_count_before_graph_capture_ >= min_num_runs_before_cuda_graph_capture_;
bool CUDAExecutionProvider::PerThreadContext::IsGraphCaptureAllowed(
CudaGraphAnnotation_t cuda_graph_annotation_id) const {
return regular_run_count_before_graph_capture_ >= min_num_runs_before_cuda_graph_capture_ &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need regular run counter for each cuda_graph_annotation_id.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#19856 for bug fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants