[DML] Hide command list reset latency with multiple threads #20067

PatriceVignola · 2024-03-25T21:00:22Z

Use multiple threads to hide command list reset latency. Resetting a command list is a blocking operation that can take quite a while to complete on the CPU. It's not really apparent for workflows where you need to execute a workload once in a while, but for LLMs where we need to reset a command list potentially 170 times per second to upload the new tokens to the GPU, it becomes a huge bottleneck.

The logic used here is to always have a ring of 3 command lists. When a command list needs to be reset, we reset it on its own thread and set the next available command list to be used. Having 3 command lists takes care of the vast majority of test cases, but we can increase the amount if it turns out it's still not enough for ultra-low latency scenarios (e.g. quantized LLMs may need even more than that).

…user/pavignol/hide-command-list-latency

onnxruntime/core/providers/dml/DmlExecutionProvider/src/DmlCommandRecorder.cpp

jeffbloo · 2024-04-04T21:00:15Z

onnxruntime/core/providers/dml/DmlExecutionProvider/src/DmlCommandRecorder.cpp

+
+        // The newest dirty allocator is now located before the last element in the ring buffer, so start resetting it
+        m_resetThreads.back() = std::thread([cachedAllocator = m_allocatorRing[m_allocatorRing.size() - 2], cachedCommandList = m_commandListRing[m_commandListRing.size() - 2]]() {
+            cachedAllocator.completionEvent.WaitForSignal();


Does this mean the thread may never exit?

It will reset when the fence gets updated, which needs to happen eventually.

This may work, but let's understand what happens in error conditions including device removal

jeffbloo · 2024-04-04T21:02:19Z

In the scenario you're optimizing, how much of the work now added to threads was previously parallelized with GPU execution? Is there any opportunity to increase that, whether it's instead of or in addition to this change?

PatriceVignola · 2024-04-04T21:47:10Z

In the scenario you're optimizing, how much of the work now added to threads was previously parallelized with GPU execution? Is there any opportunity to increase that, whether it's instead of or in addition to this change?

Some of it was parallelized, but this change is meant in scenarios where the GPU work isn't enough to completely hide the latency. This happens a lot in LLMs where we download and upload very small data between the CPU and GPU many times per second. The GPU -> CPU and CPU -> GPU copies happen in different command lists.

jeffbloo · 2024-04-05T20:49:41Z

onnxruntime/core/providers/dml/DmlExecutionProvider/src/DmlCommandRecorder.cpp

 {
    ORT_THROW_IF_FAILED(dmlDevice->CreateOperatorInitializer(0, nullptr, IID_PPV_ARGS(&m_initializer)));
    ORT_THROW_IF_FAILED(dmlDevice->CreateCommandRecorder(IID_PPV_ARGS(&m_recorder)));
+
+    m_threadPool = std::make_unique<onnxruntime::concurrency::ThreadPool>(


I think we want to re-use other thread pool instances and in doing so respect API options for thread count. May be a good idea to get input from ORT devs on this.

jeffbloo · 2024-04-05T20:51:53Z

onnxruntime/core/providers/dml/DmlExecutionProvider/src/DmlCommandRecorder.cpp

 {
    ORT_THROW_IF_FAILED(dmlDevice->CreateOperatorInitializer(0, nullptr, IID_PPV_ARGS(&m_initializer)));
    ORT_THROW_IF_FAILED(dmlDevice->CreateCommandRecorder(IID_PPV_ARGS(&m_recorder)));
+
+    m_threadPool = std::make_unique<onnxruntime::concurrency::ThreadPool>(


Is it possible to bypass thread pool work if the only submissions occur from fused partitions? I'm assuming in that the issue is fixed overhead of Reset and that CPU/GPU parallelism is still sufficient in that case.

Currently this is still needed for fused partitions (LLMs only use fused partitions after the first iteration and still benefit from it). The problem isn't executing multiple different DML command lists, but executing multiple GPU->CPU copies. Those copies are not big enough to hide latencies, so we end up waiting on the reset.

PatriceVignola · 2024-04-08T06:12:00Z

Closing PR. After improving the python I/O binding and benchmarking, this doesn't seem necessary anymore for my use cases. May revisit in the future if needed.

[DML] Hide command list reset latency with multiple threads

e7b0074

PatriceVignola requested a review from jeffbloo March 25, 2024 21:00

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

23af506

…user/pavignol/hide-command-list-latency

jeffbloo reviewed Apr 4, 2024

View reviewed changes

onnxruntime/core/providers/dml/DmlExecutionProvider/src/DmlCommandRecorder.cpp Outdated Show resolved Hide resolved

jeffbloo reviewed Apr 4, 2024

View reviewed changes

Use thread pool

6c4a3d5

PatriceVignola requested a review from jeffbloo April 5, 2024 04:25

Fix

b0131b2

jeffbloo reviewed Apr 5, 2024

View reviewed changes

PatriceVignola closed this Apr 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DML] Hide command list reset latency with multiple threads #20067

[DML] Hide command list reset latency with multiple threads #20067

PatriceVignola commented Mar 25, 2024

jeffbloo Apr 4, 2024

PatriceVignola Apr 4, 2024

jeffbloo Apr 5, 2024

jeffbloo commented Apr 4, 2024

PatriceVignola commented Apr 4, 2024

jeffbloo Apr 5, 2024

jeffbloo Apr 5, 2024

PatriceVignola Apr 5, 2024

PatriceVignola commented Apr 8, 2024

[DML] Hide command list reset latency with multiple threads #20067

[DML] Hide command list reset latency with multiple threads #20067

Conversation

PatriceVignola commented Mar 25, 2024

jeffbloo Apr 4, 2024

Choose a reason for hiding this comment

PatriceVignola Apr 4, 2024

Choose a reason for hiding this comment

jeffbloo Apr 5, 2024

Choose a reason for hiding this comment

jeffbloo commented Apr 4, 2024

PatriceVignola commented Apr 4, 2024

jeffbloo Apr 5, 2024

Choose a reason for hiding this comment

jeffbloo Apr 5, 2024

Choose a reason for hiding this comment

PatriceVignola Apr 5, 2024

Choose a reason for hiding this comment

PatriceVignola commented Apr 8, 2024