-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DML EP One session but called in different threads. [Performance] #17686
Comments
This scenario is not supported. Most of the tensor are unique to each call but onnxruntime may decide to cache some data to improve the performance. It assumes it is called from one thread and no other call can be done at the same time. You need to create one session per thread. If you need to save memory, this page may help you: https://onnxruntime.ai/docs/performance/tune-performance/memory.html#shared-arena-based-allocator. |
@hy846130226 There's no need to create multiple sessions for a model to run it concurrently. ORT is designed such that Run() can be called concurrently from multiple threads using the same session object. Here's sample code to achieve the same: https://gist.github.com/pranavsharma/c3275863291b20b538cf0cb3265ef069. |
Thanks @xadupre , I do try to create one session for each thread. But when I created the two sessions, the total time just reduced 30% compared one session. In my opinion, the sessions are independent of each other, if the GPU still has the space, two sessions will be fast twice about one session, the three sessions will be fast twice about two sessions. I do not know what the problem is. |
Thanks @pranavsharma . Is it due to different versions? My version is Microsoft.ML.OnnxRuntime.DirectMl 1.13.1 |
I try to use this demo in Microsoft.ML.OnnxRuntime 1.16.0. it works. So does the DirectML OnnxRuntime not support the parallel call session run? Or if I want to use the GPU & OnnxRuntime to run parallel, how can I do this? |
The CPU utilization is not high, the program does not do anything but create the tensor and call run continuously in two threads. |
I use more sessions, the GPU utilization is high, but the total time is not shorter, it seems like the relationship between sessions is not independent |
Can someone help me? |
hy846130226: If the GPU is saturated for a given block of work with all cores busy, then calling it from multiple CPU threads will not help. If nodes are blocked by a sequential list of tensor dependencies, then multiple sessions will not help. What is more likely to help (if possible with your model architecture) is to throw large batch sizes at it. For example, with Stable Diffusion, rather than execute it twice, just pass 2 batches through it. It might also help to look at the workload of your model within Windows PIX to see the hot spots: https://devblogs.microsoft.com/pix/download/ . |
Hi @fdwr, even I ran the two sessions, both the GPU and the CPU utilization are not high. (The GPU:30%~50%) CPU(5%) CPU utilization is very low, because the data is saved in memory before, the only work for cpu is only calling session.run continually. So I think they are not busy now. |
At first, I doubt that the onnx did some optimizations, so it makes that if I build the two sessions in the same time, they will influence each other. However, I create the sessionA for ModelA, and create sessionB for ModelB, and both of them become slow. |
Is that the weak point for dml onnx runtime? (the sessions will influence each other, even the GPU utilization is not high) |
Is there any method to develop the GPU rate when detected, I think it's too wasted. |
I am experiencing the same issue. The sample code extended to use 8 or more threads and yolov8n.onnx works fine with CPU or CUDA EP on linux, but crashes in various places with DirectML on windows. This occurs both in main branch at commit 50b4964 and tags v1.18.0 or v1.17.0. Is this the expected behavior? I tried to debug it and randomly got any of the following errors: GpuEven::IsSignaled value of fence is 0xCDCDCD.. In PooledUploadHeap::AssertInvariants: assertion fails PooledUploadHeap::ReclainAllocations calls "Exception thrown: read access violation. resourceWrapper.ptr_ was nullptr." in "Assertion failed: m_capacity == 0, file C:\dev\build_ort\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\ReadbackHeap.cpp, line 56" "expression back() called on empty vector" |
The documentation explicitly states that DirectML does not support multi-threading of Run: This caught us of guard, since there is a lot of material promoting session sharing across threads. |
Describe the issue
I want to call sessions parallelly, at first I create ONE session, then call its run in different threads at the same time, however, it crashes, seems like the run method is not an async method.
Here is my code:
auto inputTensor = Ort::Value::CreateTensor(cpuMemoryInfo, (float*)sessionCombine->Input, inputSize,inputShape.data(), inputShape.size());
auto outputTensor = Ort::Value::CreateTensor(cpuMemoryInfo, (float*)sessionCombine->OutputTemp, outputSize,outputShape.data(), outputShape.size());
session->Run(runOptions, inputNames, &inputTensor, 1, outputNames, &outputTensor, 1);
Then I created TWO sessions to distribute them into different threads, but the performance was not as good as I thought.
So my questions are:
I use the C++ dml onnx runtime.
thanks!
To reproduce
Create one session, and call its run methods in different threads at the same time.
Urgency
I am looking forward the quick responses.
Platform
Windows
OS Version
WIN10
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
Microsoft.ML.OnnxRuntime.DrectML
ONNX Runtime API
C++
Architecture
X64
Execution Provider
DirectML
Execution Provider Library Version
1.13.1
Model File
Don't have
Is this a quantized model?
Yes
The text was updated successfully, but these errors were encountered: