-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel inference of multiple models in different threads #18806
Comments
This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details. |
@jywu-msft, is additional info needed? |
I have a very similar problem when trying to "run" a model by multiple threads. I can run 'n' models in 'n' distinct threads without problem, but as soon as I try to run one model in more than one thread, I rapidly run into cuda error 700. Problem is less frequent with 1.17 or 1.16.3 than with 1.15.1 also. |
I am having the same problem. Has anyone got a solution? |
Still reproducible on the latest version to the moment -- 1.17.3. |
Describe the issue
Use case:
Model 2
uses some of the outputs ofModel 1
,Model 3
uses some of the outputs ofModel 1
andModel 2
,Run()
method always copies inputs to GPU and outputs back to host, I decided to useIoBinding
feature.The code I've got does copy data to/from GPU only when I need it to and works just fine in 1 thread, but when I trying to execute multiple pipelines
Model 1
→Model 2
→Model 3
simultaneously from different threads I starting to get errors like this:The errors can vary, but
CUDA failure 900: operation not permitted when stream is capturing
andCUDA failure 700: an illegal memory access was encountered
are very common.This issue could be reproduced on different PCs with different OS-s, CPUs and GPUs:
So this is not a problem of the environment.
I think that problem is that I manage GPU memory in a wrong way but I've read all available docs a couple of times and read a lot of tests/examples and still cannot figure out what I'm doing wrong.
To reproduce
I've tried to reproduce this issue in the isolated program, but tried to use my original code as much as I could.
Reproducer is based on one of the official examples and
IoBinding
-s code is adopted from ONNX Runtime tests.Attaching reproduser code as has 400+ loc: batch-model-explorer.cpp.
For reproducing I've used one of the public models: resnet50-v1-7.onnx.
Using ONNX Runtime v1.14.1 I've got to reproduce my issue with 4/6/8 threads. Snippet from reproducer output:
Using ONNX Runtime v1.16.3 (latest at the moment) I couldn't get errors using reproducer, but got same errors from production code, so I assume that problem is not solved but harder to reproduce.
Urgency
If this is indeed bug in ONNX Runtime then it at least blocks inference of one session on GPU in multithreaded apps.
If this is a problem in my code then there's a huge hole in docs describing how to use CUDA EP +
IoBinding
+ threads.Platform
Linux
OS Version
CentOS 7, Arch Linux
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
v1.14.1, v1.16.3
ONNX Runtime API
C++
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
CUDA 11.8
The text was updated successfully, but these errors were encountered: