-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broken multithreading inference session Onnxruntime-directml >= 1.18 #20713
Comments
Tagging @PatriceVignola @smk2007 @fdwr for visibility. |
Same here on Windows, versions 1.16.0 to 1.17.3 work fine over multiple threads, however 1.18.0 gives
|
We’ve noted the issue with GPU resource contention due to multiple threads. This usage pattern is not recommended as it makes multiple threads request all of the GPU resources, and can cause contention. Also, the allocator in python API (both CUDA and DML) is explicitly not thread safe because it initializes the allocator as a global singleton due it living outside of the session. We’re investigating the recent failure and will address it. Meanwhile, please avoid this pattern to prevent GPU contention. |
Hi @liuyunms Sorry to bother, I'm currently using an InferenceSession per tread, but you say it shouldn't be used this way. 4 threds -> 4 inference session with same gpu Do you mean to use the same InferenceSession in multiple threads? Is it possible? 4 threds -> 1 inference session with same gpu |
@PatriceVignola @smk2007 @fdwr Hi, sorry to bother, there are some news for this problem? Actually testing 1.18.1 and the problem is still present :( Thank you |
@zhangxiang1993 It still crashes using multiple threads in my application. I just tried the nightly build of 1.19 from here (I used the Python 3.11 build for Windows). I've reverted back to 1.17.3 which still works. |
Hi, I can confirm that the problem is also present on 1.19 nightly (python 3.11)
|
Not sure if this helps, but I have this method to work around it. import threading
from contextlib import nullcontext
from typing import ContextManager, Union
THREAD_SEMAPHORE : threading.Semaphore = threading.Semaphore()
NULL_CONTEXT : ContextManager[None] = nullcontext()
def conditional_thread_semaphore() -> Union[threading.Semaphore, ContextManager[None]]:
if has_execution_provider('directml') or has_execution_provider('rocm'):
return THREAD_SEMAPHORE
return NULL_CONTEXT with conditional_thread_semaphore():
onnxruntime.run() Sorry, but implement |
This works, however it defeats the purpose of running in multiple threads. In older versions that do not crash I can have the GPU running at 100%, but this workaround causes a very large performance hit. |
yeah, the performance hit is something I am aware of. |
Hi @henryruhs Thank you. I tried the solution using Semaphore and it works, but the performance is in line with using only one Thread. Hopefully they will fix the problem with the next release. |
This problem is significant, so most of us will remain on version 1.17.3. Please fix it. |
@linyu0219 It's unfortunate that the Python API is broken like this. The official docs for DirectML provider says, "Multiple threads are permitted to call Run simultaneously if they operate on different inference session objects", yet this is apparently not true if you use Python. 😟 Can we please get a fix to the Python API? |
When unloading an inference session in a multi threading scenario, it crashes the whole application. I assume various threads try to access None, still expecting an inference session. This is a DirectML only issue, we had to downgrade to 1.17.3 as well. |
I'm forced to use my own C++ Pybind11 wrapper since this official Python wrapper is broken for multithreading. 😟 Edit: |
Tested the RC diretcml 1.20.0.dev20241022005 the problem has not been solved, in fact, it have gotten worse and now with 1.20 the gpu driver crash |
@saulthu it sounds like you understood the underlying issue... is this just a binding issue? can you send a pull request? |
@henryruhs Sorry, I don't have a patch to provide. I have been a user of the python bindings for a while. The only real info I have to go on for the true cause is the comment by @liuyunms:
I have written my own pybind11 wrapper using the C++ API, which appears to run without issue, and also seems to run with better parallelism -- I guess I'm releasing the GIL for longer? Here's the rough bit of code that does the job for me, until the official python wrapper is fixed: https://gist.github.com/saulthu/c60a8f1f10352e98a986e57205cedd49 |
Hi @PatriceVignola @smk2007 @fdwr @liuyunms There are some news about this issue? |
Describe the issue
With the new version 1.18 it seems that trying to use different InferenceSession using the same DirectML device, all threads remain stalled without giving any exception or error
To reproduce
Thread 1
Thread n (where n can be any number)
Urgency
No response
Platform
Windows
OS Version
10
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.18.0
ONNX Runtime API
Python
Architecture
X64
Execution Provider
DirectML
Execution Provider Library Version
1.18.0
The text was updated successfully, but these errors were encountered: