Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DML] Hide command list reset latency with multiple threads #20067

Closed
wants to merge 4 commits into from

Conversation

PatriceVignola
Copy link
Contributor

Use multiple threads to hide command list reset latency. Resetting a command list is a blocking operation that can take quite a while to complete on the CPU. It's not really apparent for workflows where you need to execute a workload once in a while, but for LLMs where we need to reset a command list potentially 170 times per second to upload the new tokens to the GPU, it becomes a huge bottleneck.

The logic used here is to always have a ring of 3 command lists. When a command list needs to be reset, we reset it on its own thread and set the next available command list to be used. Having 3 command lists takes care of the vast majority of test cases, but we can increase the amount if it turns out it's still not enough for ultra-low latency scenarios (e.g. quantized LLMs may need even more than that).

@PatriceVignola PatriceVignola requested a review from jeffbloo March 25, 2024 21:00

// The newest dirty allocator is now located before the last element in the ring buffer, so start resetting it
m_resetThreads.back() = std::thread([cachedAllocator = m_allocatorRing[m_allocatorRing.size() - 2], cachedCommandList = m_commandListRing[m_commandListRing.size() - 2]]() {
cachedAllocator.completionEvent.WaitForSignal();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean the thread may never exit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will reset when the fence gets updated, which needs to happen eventually.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may work, but let's understand what happens in error conditions including device removal

@jeffbloo
Copy link
Contributor

jeffbloo commented Apr 4, 2024

In the scenario you're optimizing, how much of the work now added to threads was previously parallelized with GPU execution? Is there any opportunity to increase that, whether it's instead of or in addition to this change?

@PatriceVignola
Copy link
Contributor Author

In the scenario you're optimizing, how much of the work now added to threads was previously parallelized with GPU execution? Is there any opportunity to increase that, whether it's instead of or in addition to this change?

Some of it was parallelized, but this change is meant in scenarios where the GPU work isn't enough to completely hide the latency. This happens a lot in LLMs where we download and upload very small data between the CPU and GPU many times per second. The GPU -> CPU and CPU -> GPU copies happen in different command lists.

@PatriceVignola PatriceVignola requested a review from jeffbloo April 5, 2024 04:25
{
ORT_THROW_IF_FAILED(dmlDevice->CreateOperatorInitializer(0, nullptr, IID_PPV_ARGS(&m_initializer)));
ORT_THROW_IF_FAILED(dmlDevice->CreateCommandRecorder(IID_PPV_ARGS(&m_recorder)));

m_threadPool = std::make_unique<onnxruntime::concurrency::ThreadPool>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to re-use other thread pool instances and in doing so respect API options for thread count. May be a good idea to get input from ORT devs on this.

{
ORT_THROW_IF_FAILED(dmlDevice->CreateOperatorInitializer(0, nullptr, IID_PPV_ARGS(&m_initializer)));
ORT_THROW_IF_FAILED(dmlDevice->CreateCommandRecorder(IID_PPV_ARGS(&m_recorder)));

m_threadPool = std::make_unique<onnxruntime::concurrency::ThreadPool>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to bypass thread pool work if the only submissions occur from fused partitions? I'm assuming in that the issue is fixed overhead of Reset and that CPU/GPU parallelism is still sufficient in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this is still needed for fused partitions (LLMs only use fused partitions after the first iteration and still benefit from it). The problem isn't executing multiple different DML command lists, but executing multiple GPU->CPU copies. Those copies are not big enough to hide latencies, so we end up waiting on the reset.

@PatriceVignola
Copy link
Contributor Author

Closing PR. After improving the python I/O binding and benchmarking, this doesn't seem necessary anymore for my use cases. May revisit in the future if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants