Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ONNXRuntime 1.18 crashing with TensorRT EP when dealing with big inputs #21001

Closed
sansrem opened this issue Jun 11, 2024 · 13 comments
Closed

ONNXRuntime 1.18 crashing with TensorRT EP when dealing with big inputs #21001

sansrem opened this issue Jun 11, 2024 · 13 comments
Assignees
Labels
ep:TensorRT issues related to TensorRT execution provider

Comments

@sansrem
Copy link

sansrem commented Jun 11, 2024

Describe the issue

Testing ONNXRuntime 1.18 with TensorRT EP either 10.0.1 or 8.5.3
Using directly the onnxruntime-linux-x64-gpu-1.18.0.tgz for the TensorRT 10.0.1 tests and recompiled OnnxRuntime 1.18 with TensorRT 8.5.3 for the TensorRT 8.5.3 tests.

With TensorRT 10.0.1 our model is crashing when dealing with 2 input images of 4K UHDTV (3840x2167)
with this error in the shell
Error [Non-zero status code returned while running TRTKernel_graph_torch_jit_5378504288688145163_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_torch_jit_5378504288688145163_0_0' Status Message: TensorRT EP failed to create engine from network.]
and this callstack

#5 0x00007fc7f0c30cf0 in () at /lib64/libpthread.so.0
#6 0x00007fbe6b9d8102 in onnxruntime::TensorrtExecutionProvider::CreateNodeComputeInfoFromGraph(onnxruntime::GraphViewer const&, onnxruntime::Node const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, unsigned long, std::hash<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, unsigned long> > >&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, unsigned long, std::hash<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, unsigned long> > >&, std::vector<onnxruntime::NodeComputeInfo, std::allocatoronnxruntime::NodeComputeInfo >&)::{lambda(void*, OrtApi const*, OrtKernelContext*)#3}::operator()(void*, OrtApi const*, OrtKernelContext*) const [clone .isra.2141] ()
at PATH/libonnxruntime_providers_tensorrt.so
#7 0x00007fbe6b9dae50 in std::_Function_handler<onnxruntime::common::Status (void*, OrtApi const*, OrtKernelContext*), onnxruntime::TensorrtExecutionProvider::CreateNodeComputeInfoFromGraph(onnxruntime::GraphViewer const&, onnxruntime::Node const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, unsigned long, std::hash<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, unsigned long> > >&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, unsigned long, std::hash<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, unsigned long> > >&, std::vector<onnxruntime::NodeComputeInfo, std::allocatoronnxruntime::NodeComputeInfo >&)::{lambda(void*, OrtApi const*, OrtKernelContext*)#3}>::_M_invoke(std::_Any_data const&, void*&&, OrtApi const*&&, OrtKernelContext*&&) ()
at PATH/libonnxruntime_providers_tensorrt.so
#8 0x00007fc7cd2923c1 in onnxruntime::FunctionKernel::Compute(onnxruntime::OpKernelContext*) const ()
at PATH/libonnxruntime.so.1.18.0
#9 0x00007fc7cd33272f in onnxruntime::ExecuteKernel(onnxruntime::StreamExecutionContext&, unsigned long, unsigned long, bool const&, onnxruntime::SessionScope&) () at PATH/libonnxruntime.so.1.18.0
#10 0x00007fc7cd32a5ef in onnxruntime::LaunchKernelStep::Execute(onnxruntime::StreamExecutionContext&, unsigned long, onnxruntime::SessionScope&, bool const&, bool&) () at PATH/libonnxruntime.so.1.18.0
#11 0x00007fc7cd335723 in onnxruntime::RunSince(unsigned long, onnxruntime::StreamExecutionContext&, onnxruntime::SessionScope&, bool const&, unsigned long) () at PATH/libonnxruntime.so.1.18.0
#12 0x00007fc7cd3308d1 in onnxruntime::ExecuteThePlan(onnxruntime::SessionState const&, gsl::span<int const, 18446744073709551615ul>, gsl::span<OrtValue const, 18446744073709551615ul>, gsl::span<int const, 18446744073709551615ul>, std::vector<OrtValue, std::allocator >&, std::unordered_map<unsigned long, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)>, std::hash, std::equal_to, std::allocator<std::pair<unsigned long const, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)> > > > const&, onnxruntime::logging::Logger const&, onnxruntime::DeviceStreamCollection const*, bool const&, bool, bool) ()
at PATH/libonnxruntime.so.1.18.0
#13 0x00007fc7cd303ccf in onnxruntime::utils::ExecuteGraphImpl(onnxruntime::SessionState const&, onnxruntime::FeedsFetchesManage------T--T----T--Ty--T----Typ--Typ----Typ--Typ----Ty------T----T--T------T--------Type for more, q to quit, c to continue without paging--
r const&, gsl::span<OrtValue const, 18446744073709551615ul>, std::vector<OrtValue, std::allocator >&, std::unordered_map<unsigned long, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)>, std::hash, std::equal_to, std::allocator<std::pair<unsigned long const, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)> > > > const&, ExecutionMode, bool const&, onnxruntime::logging::Logger const&, onnxruntime::DeviceStreamCollection*, bool, onnxruntime::Stream*) ()
at PATH/libonnxruntime.so.1.18.0
#14 0x00007fc7cd30659c in onnxruntime::utils::ExecuteGraph(onnxruntime::SessionState const&, onnxruntime::FeedsFetchesManager&, gsl::span<OrtValue const, 18446744073709551615ul>, std::vector<OrtValue, std::allocator >&, ExecutionMode, bool const&, onnxruntime::logging::Logger const&, onnxruntime::DeviceStreamCollectionHolder&, bool, onnxruntime::Stream*) ()
at PATH/libonnxruntime.so.1.18.0
#15 0x00007fc7cd30696a in onnxruntime::utils::ExecuteGraph(onnxruntime::SessionState const&, onnxruntime::FeedsFetchesManager&, gsl::span<OrtValue const, 18446744073709551615ul>, std::vector<OrtValue, std::allocator >&, ExecutionMode, OrtRunOptions const&, onnxruntime::DeviceStreamCollectionHolder&, onnxruntime::logging::Logger const&) () at PATH/libonnxruntime.so.1.18.0
#16 0x00007fc7ccb5500a in onnxruntime::InferenceSession::Run(OrtRunOptions const&, gsl::span<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, 18446744073709551615ul>, gsl::span<OrtValue const, 18446744073709551615ul>, gsl::span<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, 18446744073709551615ul>, std::vector<OrtValue, std::allocator >, std::vector<OrtDevice, std::allocator > const) [clone .localalias.2030] () at PATH/libonnxruntime.so.1.18.0
#17 0x00007fc7ccb558e0 in onnxruntime::InferenceSession::Run(OrtRunOptions const&, gsl::span<char const* const, 18446744073709551615ul>, gsl::span<OrtValue const* const, 18446744073709551615ul>, gsl::span<char const* const, 18446744073709551615ul>, gsl::span<OrtValue*, 18446744073709551615ul>) () at PATH/libonnxruntime.so.1.18.0
#18 0x00007fc7ccae253c in OrtApis::Run(OrtSession*, OrtRunOptions const*, char const* const*, OrtValue const* const*, unsigned long, char const* const*, unsigned long, OrtValue**) ()
at PATH/libonnxruntime.so.1.18.0

If running the same model with ONNXRuntime 1.18 and TensorRT 8.5.3 it is fine with these inputs (3849x2167), still working with 6K (6531x3100) and it is crashing with 8K (7680x4320)

If running with TensorRT 10.0.1 on a machine with lower compute capability ( for example nvidia-smi --query-gpu=compute_cap --format=csv that returns 6.1 ) ONNXRuntime will crash with the same error/callstack with 2 HD images (1920x1080)

So here are the observations:
1- ONNXRuntime should not crash in all cases, it should return an error.
2- In our case going to TensorRT 10 is not an option as it crashes on older machines and it is unable to deal with the same image size than tensorRT 8.5.3

To reproduce

Use a model that takes big images as input in the TensorRT EP will make the software crash.

Urgency

No response

Platform

Linux

OS Version

Rocky Linux 8.7/9.3

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.18

ONNX Runtime API

C++

Architecture

X86

Execution Provider

TensorRT

Execution Provider Library Version

CUDA 11.8 TensorRT 10.0.1 or 8.5.3

@github-actions github-actions bot added ep:CUDA issues related to the CUDA execution provider ep:TensorRT issues related to TensorRT execution provider labels Jun 11, 2024
@chilo-ms
Copy link
Contributor

chilo-ms commented Jun 12, 2024

The error message "TensorRT EP failed to create engine from network" indicates something went wrong when TRT EP is calling
TRT's api buildSerializedNetwork() and since it happens when dealing with large image, i'm suspecting it's due to OOM.

Could you increase the trt_max_workspace_size to see? The default is 1 GB.

Also, quick question, can you repro the issue using trtexec?

@sophies927 sophies927 removed the ep:CUDA issues related to the CUDA execution provider label Jun 13, 2024
@sansrem
Copy link
Author

sansrem commented Jun 13, 2024 via email

@chilo-ms
Copy link
Contributor

chilo-ms commented Jun 13, 2024

I tried with trt_max_workspace_sizehttps://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html#trt_max_workspace_size> set to 2G, 4G, 8G with the same result getting also this additional warning if it >is set greater than 1G

Hmm that's strange. Could you share the code that set trt_max_workspace_size?
Please see the example code here.

As for trtexec, some models are not fully TRT eligible, it seems that's the case of your model, so trtexec won't be able to run them. How about trtexec with TRT 10?
Could you share the proxy model so that we can repro from our side? Or could you point to public model that can repro the issue.

@sansrem
Copy link
Author

sansrem commented Jun 14, 2024 via email

@jywu-msft
Copy link
Member

we'll sync with @skottmckay to get the model

@geraldstanje
Copy link

what is trt_max_workspace_size ?

@sansrem
Copy link
Author

sansrem commented Jun 20, 2024 via email

@chilo-ms
Copy link
Contributor

chilo-ms commented Jun 24, 2024

what is trt_max_workspace_size ?

The value of trt_max_workspace_size will determine memory size limit of the memory pool.
See TRT doc
https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/namespacenvinfer1.html#a125336eeaa69c11d9aca0535449f0391
https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_builder_config.html#a0a88a9b43bbe47c839ba65de9b40779f

2024-06-14 11:34:30.480769226 [E:onnxruntime:CF, tensorrt_execution_provider.h:82 log] [2024-06-14 15:34:30 ERROR] 4: [optimizer.cpp::computeCosts::3726] Error Code 4: Internal Error (Could not find any implementation for node {ForeignNode[onnx::Cast_507[Constant]...Concat_372]} due to insufficient workspace.

The error message showed that "insufficient workspace". It seems 8G is not enough.
Could you set the value to the GPU memory maximum size?

BTW, you can also monitor the GPU memory usage while running the inference to see memory consumption.

@chilo-ms
Copy link
Contributor

chilo-ms commented Jun 24, 2024

i did get the model from Scott, but i encountered different issue which seems related to a Concat's axis attribute.

trtexec 10.0.1 and 8.6 -> This version of TensorRT does not support dynamic axes
TRT EP -> Error Code 4: Miscellaneous (IConcatenationLayer Concat_75: Concat_75: axis 3 dimensions must be equal for concatenation on axis 1.)

Will check with Scott, or could you share the model again to make sure i'm using the same model as you?

@geraldstanje
Copy link

geraldstanje commented Jun 24, 2024

@chilo-ms trt_max_workspace_size depends on the gpu memory? The Nvidia T4 has 16 GB GDDR6 memory - so i can set 16 GB for trt_max_workspace_size?

@chilo-ms
Copy link
Contributor

chilo-ms commented Jun 24, 2024

@chilo-ms trt_max_workspace_size depends on the gpu memory? The Nvidia T4 has 16 GB GDDR6 memory - so i can set 16 GB for trt_max_workspace_size?

yes, give it a try.

@chilo-ms
Copy link
Contributor

chilo-ms commented Jun 28, 2024

Update here.

I saw similar OOM message when the workspace size is 2G when running input with 2 4K (1x6x3840x2176)
Then i increased workspace size to 16G ('trt_max_workspace_size': 17179869184) and TRT EP can successfully run the model with 2 4K input.

@jywu-msft
Copy link
Member

closing this since @chilo-ms provided last update on increasing workspace size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:TensorRT issues related to TensorRT execution provider
Projects
None yet
Development

No branches or pull requests

5 participants