-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DML] Hide command list reset latency with multiple threads #20067
Conversation
…user/pavignol/hide-command-list-latency
onnxruntime/core/providers/dml/DmlExecutionProvider/src/DmlCommandRecorder.cpp
Outdated
Show resolved
Hide resolved
|
||
// The newest dirty allocator is now located before the last element in the ring buffer, so start resetting it | ||
m_resetThreads.back() = std::thread([cachedAllocator = m_allocatorRing[m_allocatorRing.size() - 2], cachedCommandList = m_commandListRing[m_commandListRing.size() - 2]]() { | ||
cachedAllocator.completionEvent.WaitForSignal(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean the thread may never exit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will reset when the fence gets updated, which needs to happen eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may work, but let's understand what happens in error conditions including device removal
In the scenario you're optimizing, how much of the work now added to threads was previously parallelized with GPU execution? Is there any opportunity to increase that, whether it's instead of or in addition to this change? |
Some of it was parallelized, but this change is meant in scenarios where the GPU work isn't enough to completely hide the latency. This happens a lot in LLMs where we download and upload very small data between the CPU and GPU many times per second. The GPU -> CPU and CPU -> GPU copies happen in different command lists. |
{ | ||
ORT_THROW_IF_FAILED(dmlDevice->CreateOperatorInitializer(0, nullptr, IID_PPV_ARGS(&m_initializer))); | ||
ORT_THROW_IF_FAILED(dmlDevice->CreateCommandRecorder(IID_PPV_ARGS(&m_recorder))); | ||
|
||
m_threadPool = std::make_unique<onnxruntime::concurrency::ThreadPool>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to re-use other thread pool instances and in doing so respect API options for thread count. May be a good idea to get input from ORT devs on this.
{ | ||
ORT_THROW_IF_FAILED(dmlDevice->CreateOperatorInitializer(0, nullptr, IID_PPV_ARGS(&m_initializer))); | ||
ORT_THROW_IF_FAILED(dmlDevice->CreateCommandRecorder(IID_PPV_ARGS(&m_recorder))); | ||
|
||
m_threadPool = std::make_unique<onnxruntime::concurrency::ThreadPool>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to bypass thread pool work if the only submissions occur from fused partitions? I'm assuming in that the issue is fixed overhead of Reset and that CPU/GPU parallelism is still sufficient in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently this is still needed for fused partitions (LLMs only use fused partitions after the first iteration and still benefit from it). The problem isn't executing multiple different DML command lists, but executing multiple GPU->CPU copies. Those copies are not big enough to hide latencies, so we end up waiting on the reset.
Closing PR. After improving the python I/O binding and benchmarking, this doesn't seem necessary anymore for my use cases. May revisit in the future if needed. |
Use multiple threads to hide command list reset latency. Resetting a command list is a blocking operation that can take quite a while to complete on the CPU. It's not really apparent for workflows where you need to execute a workload once in a while, but for LLMs where we need to reset a command list potentially 170 times per second to upload the new tokens to the GPU, it becomes a huge bottleneck.
The logic used here is to always have a ring of 3 command lists. When a command list needs to be reset, we reset it on its own thread and set the next available command list to be used. Having 3 command lists takes care of the vast majority of test cases, but we can increase the amount if it turns out it's still not enough for ultra-low latency scenarios (e.g. quantized LLMs may need even more than that).