Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DML EP One session but called in different threads. [Performance] #17686

Open
hy846130226 opened this issue Sep 25, 2023 · 16 comments
Open

DML EP One session but called in different threads. [Performance] #17686

hy846130226 opened this issue Sep 25, 2023 · 16 comments
Labels
ep:DML issues related to the DirectML execution provider platform:windows issues related to the Windows platform quantization issues related to quantization

Comments

@hy846130226
Copy link

Describe the issue

I want to call sessions parallelly, at first I create ONE session, then call its run in different threads at the same time, however, it crashes, seems like the run method is not an async method.

Here is my code:
auto inputTensor = Ort::Value::CreateTensor(cpuMemoryInfo, (float*)sessionCombine->Input, inputSize,inputShape.data(), inputShape.size());
auto outputTensor = Ort::Value::CreateTensor(cpuMemoryInfo, (float*)sessionCombine->OutputTemp, outputSize,outputShape.data(), outputShape.size());

session->Run(runOptions, inputNames, &inputTensor, 1, outputNames, &outputTensor, 1);

Then I created TWO sessions to distribute them into different threads, but the performance was not as good as I thought.

So my questions are:

  1. How to call the session run parallelly in a different thread?
  2. Is building multiple sessions for one model good or bad?

I use the C++ dml onnx runtime.

thanks!

To reproduce

Create one session, and call its run methods in different threads at the same time.

Urgency

I am looking forward the quick responses.

Platform

Windows

OS Version

WIN10

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

Microsoft.ML.OnnxRuntime.DrectML

ONNX Runtime API

C++

Architecture

X64

Execution Provider

DirectML

Execution Provider Library Version

1.13.1

Model File

Don't have

Is this a quantized model?

Yes

@github-actions github-actions bot added ep:DML issues related to the DirectML execution provider platform:windows issues related to the Windows platform quantization issues related to quantization labels Sep 25, 2023
@hy846130226 hy846130226 changed the title [Performance] One sessions but called in differetn threads.[Performance] Sep 25, 2023
@hy846130226 hy846130226 changed the title One sessions but called in differetn threads.[Performance] One session but called in differetn threads.[Performance] Sep 25, 2023
@xadupre
Copy link
Member

xadupre commented Sep 25, 2023

This scenario is not supported. Most of the tensor are unique to each call but onnxruntime may decide to cache some data to improve the performance. It assumes it is called from one thread and no other call can be done at the same time. You need to create one session per thread. If you need to save memory, this page may help you: https://onnxruntime.ai/docs/performance/tune-performance/memory.html#shared-arena-based-allocator.

@pranavsharma
Copy link
Contributor

@hy846130226 There's no need to create multiple sessions for a model to run it concurrently. ORT is designed such that Run() can be called concurrently from multiple threads using the same session object. Here's sample code to achieve the same: https://gist.github.com/pranavsharma/c3275863291b20b538cf0cb3265ef069.

@fdwr
Copy link
Contributor

fdwr commented Sep 25, 2023

@smk2007

@fdwr fdwr changed the title One session but called in differetn threads.[Performance] DML EP One session but called in different threads. [Performance] Sep 25, 2023
@hy846130226
Copy link
Author

This scenario is not supported. Most of the tensor are unique to each call but onnxruntime may decide to cache some data to improve the performance. It assumes it is called from one thread and no other call can be done at the same time. You need to create one session per thread. If you need to save memory, this page may help you: https://onnxruntime.ai/docs/performance/tune-performance/memory.html#shared-arena-based-allocator.

Thanks @xadupre , I do try to create one session for each thread.

But when I created the two sessions, the total time just reduced 30% compared one session.
then when I create three sessions, the performance is just the same as the two.
the GPU still has the memory and the utilization is just 45%.

In my opinion, the sessions are independent of each other, if the GPU still has the space, two sessions will be fast twice about one session, the three sessions will be fast twice about two sessions.

I do not know what the problem is.

@hy846130226
Copy link
Author

@hy846130226 There's no need to create multiple sessions for a model to run it concurrently. ORT is designed such that Run() can be called concurrently from multiple threads using the same session object. Here's sample code to achieve the same: https://gist.github.com/pranavsharma/c3275863291b20b538cf0cb3265ef069.

Thanks @pranavsharma .
My first demo's logic is the same as yours, however, when I start the demo, it crashes.

Capture

Is it due to different versions?

My version is Microsoft.ML.OnnxRuntime.DirectMl 1.13.1

@hy846130226
Copy link
Author

Hi @pranavsharma

I try to use this demo in Microsoft.ML.OnnxRuntime 1.16.0. it works.

So does the DirectML OnnxRuntime not support the parallel call session run?

Or if I want to use the GPU & OnnxRuntime to run parallel, how can I do this?

@hy846130226
Copy link
Author

This scenario is not supported. Most of the tensor are unique to each call but onnxruntime may decide to cache some data to improve the performance. It assumes it is called from one thread and no other call can be done at the same time. You need to create one session per thread. If you need to save memory, this page may help you: https://onnxruntime.ai/docs/performance/tune-performance/memory.html#shared-arena-based-allocator.

Thanks @xadupre , I do try to create one session for each thread.

But when I created the two sessions, the total time just reduced 30% compared one session. then when I create three sessions, the performance is just the same as the two. the GPU still has the memory and the utilization is just 45%.

In my opinion, the sessions are independent of each other, if the GPU still has the space, two sessions will be fast twice about one session, the three sessions will be fast twice about two sessions.

I do not know what the problem is.

The CPU utilization is not high, the program does not do anything but create the tensor and call run continuously in two threads.

@hy846130226
Copy link
Author

I use more sessions, the GPU utilization is high, but the total time is not shorter, it seems like the relationship between sessions is not independent
more sessions will slow down the time for each individual session.

@hy846130226
Copy link
Author

Can someone help me?

@fdwr
Copy link
Contributor

fdwr commented Oct 12, 2023

I use more sessions, the GPU utilization is high, but the total time is not shorter
when I create three sessions, the performance is just the same as the two.

hy846130226: If the GPU is saturated for a given block of work with all cores busy, then calling it from multiple CPU threads will not help. If nodes are blocked by a sequential list of tensor dependencies, then multiple sessions will not help. What is more likely to help (if possible with your model architecture) is to throw large batch sizes at it. For example, with Stable Diffusion, rather than execute it twice, just pass 2 batches through it. It might also help to look at the workload of your model within Windows PIX to see the hot spots: https://devblogs.microsoft.com/pix/download/ .

@hy846130226
Copy link
Author

Hi @fdwr, even I ran the two sessions, both the GPU and the CPU utilization are not high. (The GPU:30%~50%) CPU(5%)

CPU utilization is very low, because the data is saved in memory before, the only work for cpu is only calling session.run continually.

So I think they are not busy now.

@hy846130226
Copy link
Author

At first, I doubt that the onnx did some optimizations, so it makes that if I build the two sessions in the same time, they will influence each other.

However, I create the sessionA for ModelA, and create sessionB for ModelB, and both of them become slow.

@hy846130226
Copy link
Author

Is that the weak point for dml onnx runtime? (the sessions will influence each other, even the GPU utilization is not high)

@hy846130226
Copy link
Author

Is there any method to develop the GPU rate when detected, I think it's too wasted.

@peter899765
Copy link

I am experiencing the same issue. The sample code extended to use 8 or more threads and yolov8n.onnx works fine with CPU or CUDA EP on linux, but crashes in various places with DirectML on windows.

This occurs both in main branch at commit 50b4964 and tags v1.18.0 or v1.17.0.

Is this the expected behavior?

I tried to debug it and randomly got any of the following errors:

GpuEven::IsSignaled value of fence is 0xCDCDCD..

In PooledUploadHeap::AssertInvariants: assertion fails assert(alloc.offsetInChunk + alloc.sizeInBytes <= nextAlloc.offsetInChunk)

PooledUploadHeap::ReclainAllocations calls _free_dbg(block, _UNKNOWN_BLOCK) in delete_scalar.cpp

"Exception thrown: read access violation. resourceWrapper.ptr_ was nullptr." in assert(resourceWrapper->GetD3D12Resource()->GetDesc().Width == bucketSize); at BucketizedBufferAllocator::Alloc

"Assertion failed: m_capacity == 0, file C:\dev\build_ort\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\ReadbackHeap.cpp, line 56"

"expression back() called on empty vector"

@oysteinkrog
Copy link

The documentation explicitly states that DirectML does not support multi-threading of Run:
https://onnxruntime.ai/docs/execution-providers/DirectML-ExecutionProvider.html
"Additionally, as the DirectML execution provider does not support parallel execution, it does not support multi-threaded calls to Run on the same inference session. That is, if an inference session using the DirectML execution provider, only one thread may call Run at a time. Multiple threads are permitted to call Run simultaneously if they operate on different inference session objects."

This caught us of guard, since there is a lot of material promoting session sharing across threads.
@pranavsharma Are you aware of the limitations of the DirectML execution provider?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:DML issues related to the DirectML execution provider platform:windows issues related to the Windows platform quantization issues related to quantization
Projects
None yet
Development

No branches or pull requests

6 participants