DML EP One session but called in different threads. [Performance] #17686

hy846130226 · 2023-09-25T10:50:47Z

Describe the issue

I want to call sessions parallelly, at first I create ONE session, then call its run in different threads at the same time, however, it crashes, seems like the run method is not an async method.

Here is my code:
auto inputTensor = Ort::Value::CreateTensor(cpuMemoryInfo, (float*)sessionCombine->Input, inputSize,inputShape.data(), inputShape.size());
auto outputTensor = Ort::Value::CreateTensor(cpuMemoryInfo, (float*)sessionCombine->OutputTemp, outputSize,outputShape.data(), outputShape.size());

session->Run(runOptions, inputNames, &inputTensor, 1, outputNames, &outputTensor, 1);

Then I created TWO sessions to distribute them into different threads, but the performance was not as good as I thought.

So my questions are:

How to call the session run parallelly in a different thread?
Is building multiple sessions for one model good or bad?

I use the C++ dml onnx runtime.

thanks!

To reproduce

Create one session, and call its run methods in different threads at the same time.

Urgency

I am looking forward the quick responses.

Platform

Windows

OS Version

WIN10

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

Microsoft.ML.OnnxRuntime.DrectML

ONNX Runtime API

C++

Architecture

X64

Execution Provider

DirectML

Execution Provider Library Version

1.13.1

Model File

Don't have

Is this a quantized model?

Yes

xadupre · 2023-09-25T11:36:50Z

This scenario is not supported. Most of the tensor are unique to each call but onnxruntime may decide to cache some data to improve the performance. It assumes it is called from one thread and no other call can be done at the same time. You need to create one session per thread. If you need to save memory, this page may help you: https://onnxruntime.ai/docs/performance/tune-performance/memory.html#shared-arena-based-allocator.

pranavsharma · 2023-09-25T16:37:59Z

@hy846130226 There's no need to create multiple sessions for a model to run it concurrently. ORT is designed such that Run() can be called concurrently from multiple threads using the same session object. Here's sample code to achieve the same: https://gist.github.com/pranavsharma/c3275863291b20b538cf0cb3265ef069.

fdwr · 2023-09-25T19:56:43Z

@smk2007

hy846130226 · 2023-09-26T02:16:42Z

This scenario is not supported. Most of the tensor are unique to each call but onnxruntime may decide to cache some data to improve the performance. It assumes it is called from one thread and no other call can be done at the same time. You need to create one session per thread. If you need to save memory, this page may help you: https://onnxruntime.ai/docs/performance/tune-performance/memory.html#shared-arena-based-allocator.

Thanks @xadupre , I do try to create one session for each thread.

But when I created the two sessions, the total time just reduced 30% compared one session.
then when I create three sessions, the performance is just the same as the two.
the GPU still has the memory and the utilization is just 45%.

In my opinion, the sessions are independent of each other, if the GPU still has the space, two sessions will be fast twice about one session, the three sessions will be fast twice about two sessions.

I do not know what the problem is.

hy846130226 · 2023-09-26T02:28:10Z

@hy846130226 There's no need to create multiple sessions for a model to run it concurrently. ORT is designed such that Run() can be called concurrently from multiple threads using the same session object. Here's sample code to achieve the same: https://gist.github.com/pranavsharma/c3275863291b20b538cf0cb3265ef069.

Thanks @pranavsharma .
My first demo's logic is the same as yours, however, when I start the demo, it crashes.

Is it due to different versions?

My version is Microsoft.ML.OnnxRuntime.DirectMl 1.13.1

hy846130226 · 2023-09-26T02:50:33Z

Hi @pranavsharma

I try to use this demo in Microsoft.ML.OnnxRuntime 1.16.0. it works.

So does the DirectML OnnxRuntime not support the parallel call session run?

Or if I want to use the GPU & OnnxRuntime to run parallel, how can I do this?

hy846130226 · 2023-09-26T02:56:49Z

This scenario is not supported. Most of the tensor are unique to each call but onnxruntime may decide to cache some data to improve the performance. It assumes it is called from one thread and no other call can be done at the same time. You need to create one session per thread. If you need to save memory, this page may help you: https://onnxruntime.ai/docs/performance/tune-performance/memory.html#shared-arena-based-allocator.

Thanks @xadupre , I do try to create one session for each thread.

But when I created the two sessions, the total time just reduced 30% compared one session. then when I create three sessions, the performance is just the same as the two. the GPU still has the memory and the utilization is just 45%.

In my opinion, the sessions are independent of each other, if the GPU still has the space, two sessions will be fast twice about one session, the three sessions will be fast twice about two sessions.

I do not know what the problem is.

The CPU utilization is not high, the program does not do anything but create the tensor and call run continuously in two threads.

hy846130226 · 2023-10-07T07:19:49Z

I use more sessions, the GPU utilization is high, but the total time is not shorter, it seems like the relationship between sessions is not independent
more sessions will slow down the time for each individual session.

hy846130226 · 2023-10-08T01:26:46Z

Can someone help me?

fdwr · 2023-10-12T02:33:49Z

I use more sessions, the GPU utilization is high, but the total time is not shorter
when I create three sessions, the performance is just the same as the two.

hy846130226: If the GPU is saturated for a given block of work with all cores busy, then calling it from multiple CPU threads will not help. If nodes are blocked by a sequential list of tensor dependencies, then multiple sessions will not help. What is more likely to help (if possible with your model architecture) is to throw large batch sizes at it. For example, with Stable Diffusion, rather than execute it twice, just pass 2 batches through it. It might also help to look at the workload of your model within Windows PIX to see the hot spots: https://devblogs.microsoft.com/pix/download/ .

hy846130226 · 2023-10-12T09:58:56Z

Hi @fdwr, even I ran the two sessions, both the GPU and the CPU utilization are not high. (The GPU:30%~50%) CPU(5%)

CPU utilization is very low, because the data is saved in memory before, the only work for cpu is only calling session.run continually.

So I think they are not busy now.

hy846130226 · 2023-10-12T10:02:27Z

At first, I doubt that the onnx did some optimizations, so it makes that if I build the two sessions in the same time, they will influence each other.

However, I create the sessionA for ModelA, and create sessionB for ModelB, and both of them become slow.

hy846130226 · 2023-10-12T10:03:52Z

Is that the weak point for dml onnx runtime? (the sessions will influence each other, even the GPU utilization is not high)

hy846130226 · 2023-11-10T09:50:36Z

Is there any method to develop the GPU rate when detected, I think it's too wasted.

peter899765 · 2024-06-20T11:45:20Z

I am experiencing the same issue. The sample code extended to use 8 or more threads and yolov8n.onnx works fine with CPU or CUDA EP on linux, but crashes in various places with DirectML on windows.

This occurs both in main branch at commit 50b4964 and tags v1.18.0 or v1.17.0.

Is this the expected behavior?

I tried to debug it and randomly got any of the following errors:

GpuEven::IsSignaled value of fence is 0xCDCDCD..

In PooledUploadHeap::AssertInvariants: assertion fails assert(alloc.offsetInChunk + alloc.sizeInBytes <= nextAlloc.offsetInChunk)

PooledUploadHeap::ReclainAllocations calls _free_dbg(block, _UNKNOWN_BLOCK) in delete_scalar.cpp

"Exception thrown: read access violation. resourceWrapper.ptr_ was nullptr." in assert(resourceWrapper->GetD3D12Resource()->GetDesc().Width == bucketSize); at BucketizedBufferAllocator::Alloc

"Assertion failed: m_capacity == 0, file C:\dev\build_ort\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\ReadbackHeap.cpp, line 56"

"expression back() called on empty vector"

oysteinkrog · 2024-09-16T12:01:20Z

The documentation explicitly states that DirectML does not support multi-threading of Run:
https://onnxruntime.ai/docs/execution-providers/DirectML-ExecutionProvider.html
"Additionally, as the DirectML execution provider does not support parallel execution, it does not support multi-threaded calls to Run on the same inference session. That is, if an inference session using the DirectML execution provider, only one thread may call Run at a time. Multiple threads are permitted to call Run simultaneously if they operate on different inference session objects."

This caught us of guard, since there is a lot of material promoting session sharing across threads.
@pranavsharma Are you aware of the limitations of the DirectML execution provider?

github-actions bot added ep:DML issues related to the DirectML execution provider platform:windows issues related to the Windows platform quantization issues related to quantization labels Sep 25, 2023

hy846130226 changed the title ~~[Performance]~~ One sessions but called in differetn threads.[Performance] Sep 25, 2023

hy846130226 changed the title ~~One sessions but called in differetn threads.[Performance]~~ One session but called in differetn threads.[Performance] Sep 25, 2023

fdwr changed the title ~~One session but called in differetn threads.[Performance]~~ DML EP One session but called in different threads. [Performance] Sep 25, 2023

hy846130226 mentioned this issue Sep 26, 2023

How to improve GPU utilization[Performance] #17688

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DML EP One session but called in different threads. [Performance] #17686

DML EP One session but called in different threads. [Performance] #17686

hy846130226 commented Sep 25, 2023

xadupre commented Sep 25, 2023

pranavsharma commented Sep 25, 2023

fdwr commented Sep 25, 2023

hy846130226 commented Sep 26, 2023

hy846130226 commented Sep 26, 2023

hy846130226 commented Sep 26, 2023

hy846130226 commented Sep 26, 2023

hy846130226 commented Oct 7, 2023

hy846130226 commented Oct 8, 2023

fdwr commented Oct 12, 2023

hy846130226 commented Oct 12, 2023

hy846130226 commented Oct 12, 2023

hy846130226 commented Oct 12, 2023

hy846130226 commented Nov 10, 2023

peter899765 commented Jun 20, 2024

oysteinkrog commented Sep 16, 2024

DML EP One session but called in different threads. [Performance] #17686

DML EP One session but called in different threads. [Performance] #17686

Comments

hy846130226 commented Sep 25, 2023

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

xadupre commented Sep 25, 2023

pranavsharma commented Sep 25, 2023

fdwr commented Sep 25, 2023

hy846130226 commented Sep 26, 2023

hy846130226 commented Sep 26, 2023

hy846130226 commented Sep 26, 2023

hy846130226 commented Sep 26, 2023

hy846130226 commented Oct 7, 2023

hy846130226 commented Oct 8, 2023

fdwr commented Oct 12, 2023

hy846130226 commented Oct 12, 2023

hy846130226 commented Oct 12, 2023

hy846130226 commented Oct 12, 2023

hy846130226 commented Nov 10, 2023

peter899765 commented Jun 20, 2024

oysteinkrog commented Sep 16, 2024