Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Null Tensor as output when using TensorRT without session io bindings #18509

Open
cozeybozey opened this issue Nov 20, 2023 · 15 comments
Open

Null Tensor as output when using TensorRT without session io bindings #18509

cozeybozey opened this issue Nov 20, 2023 · 15 comments
Assignees
Labels
ep:CUDA issues related to the CUDA execution provider ep:TensorRT issues related to TensorRT execution provider platform:windows issues related to the Windows platform

Comments

@cozeybozey
Copy link

Describe the issue

When I use the TensorRT execution provider and I use session.run like this:
'''
std::vectorOrt::Value lOutputTensors = mOrtSession->Run(Ort::RunOptions{ nullptr }, mInputNodeNamePointers.data(), lInputTensors.data(),
mInputNodeNamePointers.size(), mOutputNodeNamePointers.data(), mOutputNodeNamePointers.size());
'''
Then I get errors when calling lOutputTensors[0].GetTensorMutableData() saying that the output is null. This does not happen when I use the cuda execution provider. Also when I use session io bindings with TensorRT then I have no issues. But there are some reasons that lead to me not wanting to use session io bindings, so I would like it to work without them. I did found someone online who seemingly got it to work without session io bindings, but I saw that he was using an experimental version of onnxruntime:
#14614

So I guess a follow up question is, if it is necessary how do I build the experimental version of onnxruntime?

To reproduce

Using TensorRT as your execution provider and then calling session.run() like this:
'''
std::vectorOrt::Value lOutputTensors = mOrtSession->Run(Ort::RunOptions{ nullptr }, mInputNodeNamePointers.data(), lInputTensors.data(),
mInputNodeNamePointers.size(), mOutputNodeNamePointers.data(), mOutputNodeNamePointers.size());
'''
Followed by lOutputTensors[0].GetTensorMutableData()

Urgency

No response

Platform

Windows

OS Version

10.0.19045

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.16.1

ONNX Runtime API

C++

Architecture

X64

Execution Provider

TensorRT

Execution Provider Library Version

8.6.1.6

@github-actions github-actions bot added ep:CUDA issues related to the CUDA execution provider ep:TensorRT issues related to TensorRT execution provider platform:windows issues related to the Windows platform labels Nov 20, 2023
@jywu-msft
Copy link
Member

@chilo-ms , can you assist?

@chilo-ms
Copy link
Contributor

chilo-ms commented Nov 29, 2023

You don't have to use experimental version onnxruntime, using the apis from onnxruntime_cxx_api.h should work.
I can't repro from my side, so could you share your code of creating the OrtTensorRTProviderOptionsV2 and calling the SessionOptionsAppendExecutionProvider_TensorRT_V2?
More lines of code might help us investigate.

https://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html#samples

@cozeybozey
Copy link
Author

This is where I am creating and appending the options:
image

@cozeybozey
Copy link
Author

Hi, I was wondering if there is any follow up on this issue?

@jywu-msft
Copy link
Member

sorry for the delay. I didn't spot anything wrong with your options. would it be possible for you to provide a small test case (model + code) which can reproduce the null behavior? I will ask @yf711 to investigate it.

@cozeybozey
Copy link
Author

Unfortunately it is not super easy for me to provide a small test case, but I will look into it. In the mean time I also found that TensorRT does actually work for the first batch, but all subsequent batches do not work. I only figured this out now, because the first batch is always nonsense data given by a warmup function, so I never checked the output there.

@cozeybozey
Copy link
Author

Sorry for the delay on the small test case, there are some issues with yolov8 models and licensing. But, I did some more testing myself and I found that I do not run into the issue if I turn trt_cuda_graph_enable off. Does that make sense, or does that mean that there is a bug in trt_cuda_graph_enable? I also found that this fixed my issue with running multiple parallel threads that call model.run on the same session.

@jywu-msft
Copy link
Member

Sorry for the delay on the small test case, there are some issues with yolov8 models and licensing. But, I did some more testing myself and I found that I do not run into the issue if I turn trt_cuda_graph_enable off. Does that make sense, or does that mean that there is a bug in trt_cuda_graph_enable? I also found that this fixed my issue with running multiple parallel threads that call model.run on the same session.

ah, that's a key piece of information. there are some known bugs in 1.16.1,
would it be possible for you to test with the latest main (build from source) (or wait until 1.17 is released later this month)

@cozeybozey
Copy link
Author

Good to know that there might be a problem in 1.16.1, that is indeed the version I am using. I am not in a huge rush, so I will wait for the 1.17 release.

@cozeybozey
Copy link
Author

Hello, it has been a while, but I am working with TensorRT again and I wanted to say that I am experiencing the same issue even with onnxruntime 1.17. It is also still the case then when I turn off trt_cuda_graph_enable it does work. So I was wondering whether you can mabye look into this issue again. And how problematic is it to turn trt_cuda_graph_enable off? Will that significantly impact the performance?

@chilo-ms
Copy link
Contributor

chilo-ms commented Jul 29, 2024

Hi,
The null tensor as output you saw using TRT EP with trt_cuda_graph_enable enabled, is due to you didn't use io-bindings.

Please see the constrains of using cuda graph in the doc:
''''
By design, CUDA Graphs is designed to read from/write to the same CUDA virtual memory addresses during the graph replaying step as it does during the graph capturing step. Due to this requirement, usage of this feature requires using IOBinding so as to bind memory which will be used as input(s)/output(s) for the CUDA Graph machinery to read from/write to
''''

Also, if you want to use multithreading, please make sure only one thread initializes the ORT session instance, and then have multiple parallel threads that call model.run on the same session.

(Note: ORT should warn the users if they use cuda graph without io-bindings.)

@chilo-ms
Copy link
Contributor

And how problematic is it to turn trt_cuda_graph_enable off? Will that significantly impact the performance?

It depends. The major advantage of CUDA graph is that it decreases kernel launch time, especially the model has several CUDA kernels.

I suggest you can run the perf test against your model with cuda graph enabled and disabled to see the result.

@cozeybozey
Copy link
Author

Alright thanks for the quick and clear response! I will investigate whether it affects the performance.

@cozeybozey
Copy link
Author

Also, if you want to use multithreading, please make sure only one thread initializes the ORT session instance, and then have multiple parallel threads that call model.run on the same session.

Am I misunderstanding or do the docs say that you cannot use multithreading at all?
image

I think multithreading in general will be difficult if you have to use the exact same memory location for the input data, since that will lead to multiple threads overwriting each others data.

@chilo-ms
Copy link
Contributor

chilo-ms commented Jul 30, 2024

Also, if you want to use multithreading, please make sure only one thread initializes the ORT session instance, and then have multiple parallel threads that call model.run on the same session.

After looking closer to ORT's cuda graph replay code in InferenceSession.Run() and the cuda graph doc, i might be wrong about my previously multi-threading statement.

Even though, InferenceSession.Run() is thread-safe, but per doc, "cuda graph objects are not internally synchronized and must not be accessed concurrently from multiple threads", meaning if multiple threads are calling Run() on same inference session, they are accessing the same cuda graph object concurrently which is not suggested.

Therefore, we suggest not to use multithreading for cuda graph with ORT TRT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:CUDA issues related to the CUDA execution provider ep:TensorRT issues related to TensorRT execution provider platform:windows issues related to the Windows platform
Projects
None yet
Development

No branches or pull requests

3 participants