You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi all, I'm running Microsoft DirectML NPU inference sample code. which is OnnxRuntime + DML as execution provider.
It seems that memory keep in increasing slowly in the performance test for loop. (Check To reproduce part1)
If I change the input buffer from DML to CPU memory, the memory increase dramatically (Check To reproduce part2).
Did I miss anything? Thanks.
To reproduce
Every time.
Part1:
constexpr int fenceValueStart = 2;
constexpr int numIterations = 100000;
for (int i = fenceValueStart; i < (numIterations + fenceValueStart); i++)
{
session.Run(Ort::RunOptions{ nullptr }, &inputName, &inputTensor, 1, &outputName, &outputTensor, 1);
{
// Synchronize with CPU before queuing more inference runs
THROW_IF_FAILED(commandQueue->Signal(fence.Get(), i));
THROW_HR_IF(E_FAIL, ResetEvent(fenceEvent.get()) == 0);
THROW_IF_FAILED(fence->SetEventOnCompletion(i, fenceEvent.get()));
THROW_HR_IF(E_FAIL, WaitForSingleObject(fenceEvent.get(), INFINITE) != WAIT_OBJECT_0);
}
}
Part2:
void main()
{
ComPtr<ID3D12Device1> d3dDevice;
ComPtr<IDMLDevice> dmlDevice;
ComPtr<ID3D12CommandQueue> commandQueue;
InitializeDirectML(d3dDevice.GetAddressOf(), commandQueue.GetAddressOf(), dmlDevice.GetAddressOf());
// Add the DML execution provider to ORT using the DML Device and D3D12 Command Queue created above.
if (!dmlDevice)
{
printf("No NPU device found\n");
return;
}
////////////////////////////////////////
// Get API, and setup environment.
OrtApi const& ortApi = Ort::GetApi(); // Uses ORT_API_VERSION
const OrtDmlApi* ortDmlApi;
ortApi.GetExecutionProviderApi("DML", ORT_API_VERSION, reinterpret_cast<const void**>(&ortDmlApi));
Ort::Env environment(ORT_LOGGING_LEVEL_WARNING, "DirectML_Direct3D_TensorAllocation_Test"); // Note ORT_LOGGING_LEVEL_VERBOSE is useful too.
////////////////////////////////////////
// Set model-specific session options.
Ort::SessionOptions sessionOptions;
sessionOptions.SetExecutionMode(ExecutionMode::ORT_SEQUENTIAL); // For DML EP
sessionOptions.DisableMemPattern(); // For DML EP
ortApi.AddFreeDimensionOverrideByName(sessionOptions, "batch_size", 1);
ortDmlApi->SessionOptionsAppendExecutionProvider_DML1(sessionOptions, dmlDevice.Get(), commandQueue.Get());
Ort::Session session(environment, L"mobilenetv2-7-fp16.onnx", sessionOptions);
std::vector<Ort::Value> inputTensors;
std::vector<int64_t> inputShape = { 1, 3, 224, 224 };
std::vector<float> my_data(3 * 224 * 224, 0.0f);
std::vector<Ort::Float16_t> inputTensorValues;
for (auto n : my_data)
inputTensorValues.push_back(Ort::Float16_t(n));
Ort::MemoryInfo memoryInfo = Ort::MemoryInfo::CreateCpu(OrtAllocatorType::OrtArenaAllocator, OrtMemType::OrtMemTypeDefault);
inputTensors.push_back(Ort::Value::CreateTensor<Ort::Float16_t>(memoryInfo, inputTensorValues.data(), inputTensorValues.size(), inputShape.data(), inputShape.size()));
std::vector<char const*> inputNames = { "input" };
std::vector<char const*> outputNames = { "output" };
////////////////////////////////////////
// Execute the model with the given inputs and named outputs.
for (int i = 0; i < 1000000; i++) {
std::vector<Ort::Value> outputs = session.Run(Ort::RunOptions{}, inputNames.data(), inputTensors.data(), inputTensors.size(), outputNames.data(), outputNames.size());
outputs.clear();
}
}
Hi, for part 1 leak, a quick workaround I found was adding ReleaseCompletedReferences() in CommandQueue::QueueReference in CommandQueue.cpp, and rebuild the onnxruntime.dll.
Here's the detail code.
void CommandQueue::QueueReference(IUnknown* object, bool waitForUnsubmittedWork)
{
// If the CommandQueue is closing, then m_queuedReferences is being cleared -- it is not OK
// to queue additional references at this time, since those references would be leaked. This
// affects any objects in m_queuedReferences whose destructors indirectly call QueueReference;
// for example, an allocation from BucketizedBufferAllocator attempts to queue a reference
// to its underlying D3D resource when freed. Furthermore, these references are unnecessary
// since Close() already blocks for scheduled GPU work before clearing m_queuedReferences.
if (!m_closing)
{
QueuedReference queuedReference = {GetLastFenceValue(), object};
// If something has been recorded into a command list but not submitted yet, it means that the *next* fence
// value is the one to signal completion.
if (waitForUnsubmittedWork)
{
++queuedReference.fenceValue;
}
m_queuedReferences.push_back(queuedReference);
ReleaseCompletedReferences(); // new line here
}
}
Describe the issue
Hi all, I'm running Microsoft DirectML NPU inference sample code. which is OnnxRuntime + DML as execution provider.
It seems that memory keep in increasing slowly in the performance test for loop. (Check To reproduce part1)
If I change the input buffer from DML to CPU memory, the memory increase dramatically (Check To reproduce part2).
Did I miss anything? Thanks.
To reproduce
Every time.
Part1:
Part2:
Urgency
Urgent.
Platform
Windows
OS Version
11 23H2
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.17.0
ONNX Runtime API
C++
Architecture
X64
Execution Provider
DirectML
Execution Provider Library Version
DirectML 1.15.0
Other info
CPU: Intel Ultra 7 155U
NPU driver: 32.0.100.2540
GPU driver(intel): 32.0.15.5585
The text was updated successfully, but these errors were encountered: