DirectML error: The parameter is incorrect with KBNet S #21583

marovira · 2024-08-01T00:42:18Z

Describe the issue

When trying to run KBNet-S (see here) using ONNXRuntime with DirectML, an error occurs during the creation of the session that reads:

2024-07-31 17:26:51.9481346 [E:onnxruntime:, inference_session.cc:2045 onnxruntime::InferenceSession::Initialize::<lambda_deb1d9a98dc3fb814563870e4f4b9f20>::operator ()] Exception during initialization: D:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\AbiCustomRegistry.cpp(519)\onnxruntime_pybind11_state.pyd!00007FF9E88E1B92: (caller: 00007FF9E88485ED) Exception(3) tid(25d60) 80070057 The parameter is incorrect.

This issue does not appear when trying to run using the CPU execution provider.

To reproduce

Create a new virtual environment and run:

pip install torch onnx onnxruntime-directml

Then copy the following files from the KBNet repo:

Open kbnet_s_arch.py and modify the imports as follows:

# Replace this:
from basicsr.models.archs.kb_utils import KBAFunction
from basicsr.models.archs.kb_utils import LayerNorm2d, SimpleGate

# With this:
from kb_utils import KBAFunction, LayerNorm2d, SimpleGate

Next, download sidd.pth into the same directory from here

Finally, create a new file called test.py with the following code:

from kbnet import KBNet_s
import torch
import onnx
import onnxruntime

net = KBNet_s(lightweight=True, ffn_scale=1.5).cpu()
state = torch.load("sidd.pth", weights_only=True)
net.load_state_dict(state["model"])
net.eval()
x = torch.randn((1, 3, 128, 128))
with torch.no_grad():
    out = net(x)

torch.onnx.export(net, x, "sidd.onnx", export_params=True, do_constant_folding=True,
                  input_names=["input"],
                  output_names=["output"])

onnx_model = onnx.load("sidd.onnx")
onnx.checker.check_model(onnx_model) # Note that this passes, so the exported ONNX file is correct.

ort_session = onnxruntime.InferenceSession("sidd.onnx", providers=["DmlExecutionProvider"]) # <- This line fails!

Run with python3 test.py and see that it fails when trying to create the session. If instead the providers are set to providers=["CpuExecutionProvider"] or providers=["CpuExecutionProvider, DmlExecutionProvider"], the session is created correctly.

Urgency

Medium urgency. This is blocking a research task I'm currently working on and the deadline is coming up fast. I can work around the issue for now by using the CPU, but I need the GPU for performance reasons.

Platform

Windows

OS Version

Windows 11

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.18.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

DirectML

Execution Provider Library Version

DirectML 1.14.1.0

The text was updated successfully, but these errors were encountered:

marovira · 2024-08-01T00:45:54Z

Additional Info

The issue can also be reproduced using C++ with ONNXRuntime and the DirectML provider. Through debugging, I've discovered that the issue is coming from the graph fusion system. Specifically, when it attempts to process /encoders.0/ecoders.0.0/MatMul_2, an exception is thrown when trying to create a DML_OPERATOR_GEMM. I am unable to determine why the parameters are incorrect however.

fdwr · 2024-08-01T03:17:07Z

the issue is coming from the graph fusion system

I wonder if specifying a lower optimization level like GraphOptimizationLevel like ORT_ENABLE_BASIC would mitigate the issue until it can be investigated?

marovira · 2024-08-01T03:39:28Z

I wonder if specifying a lower optimization level like GraphOptimizationLevel like ORT_ENABLE_BASIC would mitigate the issue until it can be investigated?

I've just confirmed that setting the optimisation level to ORT_ENABLE_BASIC doesn't remove the error message:

2024-07-31 20:34:15.4335912 [E:onnxruntime:, inference_session.cc:2045 onnxruntime::InferenceSession::Initialize::<lambda_deb1d9a98dc3fb814563870e4f4b9f20>::operator ()] Exception during initialization: D:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\AbiCustomRegistry.cpp(519)\onnxruntime_pybind11_state.pyd!00007FF9979F1B92: (caller: 00007FF9979585ED) Exception(3) tid(1ca60) 80070057 The parameter is incorrect.

Interestingly, if I change the optimisation level to ORT_DISABLE_ALL, I get this output instead:

2024-07-31 20:36:05.2756227 [W:onnxruntime:, session_state.cc:1166 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-07-31 20:36:05.2793056 [W:onnxruntime:, session_state.cc:1168 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2024-07-31 20:36:07.4558489 [E:onnxruntime:, sequential_executor.cc:516 onnxruntime::ExecuteKernel] Non-zero status code returned while running MatMul node. Name:'/encoders.0/encoders.0.0/MatMul_2' Status Message: D:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\MLOperatorAuthorImpl.cpp(2468)\onnxruntime_pybind11_state.pyd!00007FF99799F21F: (caller: 00007FF9979A06DA) Exception(3) tid(259d4) 80070057 The parameter is incorrect.

That error message shows the node in which the exception is being thrown, which I mentioned in my previous message.

github-actions · 2024-08-31T15:00:45Z

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

marovira · 2024-09-01T18:17:52Z

@fdwr is there any more information I can provide that would help with diagnosing/fixing this?

fdwr · 2024-09-06T00:50:00Z

@fdwr is there any more information I can provide that would help with diagnosing/fixing this?

Are there any DirectML debug layer messages, if you enable it?

#13330 (comment)
#15255 (comment)

marovira · 2024-09-06T13:40:43Z

Here's the output with the debug layer enabled:

D3D12 ERROR: An invalid dimension count of 5 was specified in tensor 'A' which is not between 2 and 4. [ UNKNOWN ERROR #1: STRING_FROM_APPLICATION]
C:\__w\1\s\SharedValidation/TensorValidator.h(753)\DirectML.Debug.dll!00007FFB58A81BAE: (caller: 00007FFB58A83B85) Exception(1) tid(5f48c) 80070057 The parameter is incorrect.
Exception thrown at 0x00007FFC80D5FABC in denoise.exe: Microsoft C++ exception: wil::ResultException at memory location 0x0000004AEE1936B0.
Exception thrown at 0x00007FFC80D5FABC in denoise.exe: Microsoft C++ exception: [rethrow] at memory location 0x0000000000000000.
C:\__w\1\s\Debug\Product\DmlDeviceDebug.cpp(80)\DirectML.Debug.dll!00007FFB58B9EB2E: (caller: 00007FFB5CD2719F) ReturnHr(1) tid(5f48c) 80070057 The parameter is incorrect.
    Msg:[C:\__w\1\s\SharedValidation/TensorValidator.h(753)\DirectML.Debug.dll!00007FFB58A81BAE: (caller: 00007FFB58A83B85) Exception(1) tid(5f48c) 80070057 The parameter is incorrect.
] 
E:\github\sdk\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\Operators\DmlOperator.cpp(46)\onnxruntimed.dll!00007FFB5C928719: (caller: 00007FFB5CCB99C9) Exception(1) tid(5f48c) 80070057 The parameter is incorrect.
    [Dml::DmlOperator::SetDmlOperatorDesc(m_dmlDevice->CreateOperator(&operatorDesc, __uuidof(**(&dmlOperator)), IID_PPV_ARGS_Helper(&dmlOperator)))]
Exception thrown at 0x00007FFC80D5FABC in denoise.exe: Microsoft C++ exception: wil::ResultException at memory location 0x0000004AEE193B70.
Exception thrown at 0x00007FFC80D5FABC in denoise.exe: Microsoft C++ exception: [rethrow] at memory location 0x0000000000000000.
E:\github\sdk\onnxruntime\onnxruntime\core/providers/dml/OperatorAuthorHelper/MLOperatorAuthorHelper.h(965)\onnxruntimed.dll!00007FFB5E9E1ABE: (caller: 00007FFB5CCB89BE) ReturnHr(1) tid(5f48c) 80070057 The parameter is incorrect.
    Msg:[E:\github\sdk\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\Operators\DmlOperator.cpp(46)\onnxruntimed.dll!00007FFB5C928719: (caller: 00007FFB5CCB99C9) Exception(1) tid(5f48c) 80070057 The parameter is incorrect.
    [Dml::DmlOperator::SetDmlOperatorDesc(m_dmlDevice->CreateOperator(&operatorDesc, __uuidof(**(&dmlOperator)), IID_PPV_ARGS_Helper(&dmlOperator)))]
] [MLOperatorKernel<class Dml::DmlOperatorMatMul>::CreateInstance]
E:\github\sdk\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\Operators\DmlOperatorMatMul.cpp(58)\onnxruntimed.dll!00007FFB5C928719: (caller: 00007FFB5CA5B950) Exception(2) tid(5f48c) 80070057 The parameter is incorrect.
    [Dml::CreateMatMul(MLOperatorKernel<T>::CreateInstance(*kernelInfo, opKernel))]
Exception thrown at 0x00007FFC80D5FABC in denoise.exe: Microsoft C++ exception: wil::ResultException at memory location 0x0000004AEE194F90.
Exception thrown at 0x00007FFC80D5FABC in denoise.exe: Microsoft C++ exception: [rethrow] at memory location 0x0000000000000000.
E:\github\sdk\onnxruntime\onnxruntime\core/providers/dml/OperatorAuthorHelper/MLOperatorAuthorHelper.h(1081)\onnxruntimed.dll!00007FFB5E98793B: (caller: 00007FFB5CB6070C) ReturnHr(2) tid(5f48c) 80070057 The parameter is incorrect.
    Msg:[E:\github\sdk\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\Operators\DmlOperatorMatMul.cpp(58)\onnxruntimed.dll!00007FFB5C928719: (caller: 00007FFB5CA5B950) Exception(2) tid(5f48c) 80070057 The parameter is incorrect.
    [Dml::CreateMatMul(MLOperatorKernel<T>::CreateInstance(*kernelInfo, opKernel))]
] [MLOperatorKernelFactory::CreateKernel]
E:\github\sdk\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\AbiCustomRegistry.cpp(519)\onnxruntimed.dll!00007FFB5C928719: (caller: 00007FFB5CB6D93F) Exception(3) tid(5f48c) 80070057 The parameter is incorrect.
    [Windows::AI::MachineLearning::Adapter::AbiCustomRegistry::RegisterOperatorKernel::<lambda_4e824286f9d658116fa9a3df675eaad5>::operator ()(kernelFactoryCapture->CreateKernel(kernelInfoWrapper.Get(), kernel.GetAddressOf()))]
Exception thrown at 0x00007FFC80D5FABC in denoise.exe: Microsoft C++ exception: wil::ResultException at memory location 0x0000004AEE195020.

fdwr · 2024-09-06T21:32:18Z

Oooh, so there's a 5D matmul in this model then?

D3D12 ERROR: An invalid dimension count of 5 was specified in tensor 'A' which is not between 2 and 4.
E:\github\sdk\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\Operators\DmlOperatorMatMul.cpp(58)\onnxruntimed.dll!00007FFB5C928719: (caller: 00007FFB5CA5B950) Exception(2) tid(5f48c) 80070057 The parameter is incorrect.

Would you know if there's an upload on HuggingFace or elsewhere of the direct .onnx model?

I've not actually encountered a 5D matmul before in ONNX models, and DML_GEMM_OPERATOR_DESC currently only accepts 4D, requiring either DirectML.dll updates or some flattening to 4D of leading dimensions (if not broadcasted) before calling DirectML.

marovira · 2024-09-06T22:19:08Z

Would you know if there's an upload on HuggingFace or elsewhere of the direct .onnx model?

No, the authors only provide the PTH files. I'll see if I can post it somewhere so you can download the ONNX file.

Edit:
I've uploaded the ONNX file to Google Drive

fdwr · 2024-09-06T22:30:36Z

Edit: I've uploaded the ONNX file to Google Drive

~~Thanks - downloading now.~~
Downloaded.
I can repro it locally.

marovira · 2024-09-06T22:58:06Z

Let me know if there's anything else I can do to help.

fdwr · 2024-09-07T05:09:27Z

Reduced to minimal repro, a single operator .onnx file: minimal-repro.zip

Opened bug for DirectML.dll. I'll see what the response is, but in the meantime, we should probably attempt to flatten the leading dimensions when >4D here https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/dml/DmlExecutionProvider/src/Operators/DmlOperatorGemm.cpp#L36. Cheers TotK fan.

john-dance · 2024-10-08T23:42:00Z

See also #21875
This same issue happens with the following models on a Windows on ARM machine.
https://aihub.qualcomm.com/mobile/models/ffnet_54s
https://aihub.qualcomm.com/models/esrgan
https://aihub.qualcomm.com/models/whisper_tiny_en
https://aihub.qualcomm.com/models/mediapipe_hand (MediaPipeHandLandmarkDetector)

john-dance · 2024-10-15T22:56:39Z

Note: This seems to have been fixed after upgrading to DirectML.dll 1.15.1. I have verified that the failures I reported above now all work.

marovira · 2024-10-28T23:45:02Z

@fdwr out of interest: is there a way to track the bug that you opened for DirectML?
Related to this: you mentioned that ONNXRuntime should attempt to flatten the leading dimensions. Is this being looked at somewhere?

fdwr · 2024-10-31T01:56:59Z

@fdwr out of interest: is there a way to track the bug that you opened for DirectML? Related to this: you mentioned that ONNXRuntime should attempt to flatten the leading dimensions. Is this being looked at somewhere?

@marovira: It's internal, but I can confirm that a teammate is working on it and looking at the change now. So there shouldn't need to be a need to update the ORT DML EP when DML directly supports it.

marovira · 2024-10-31T16:51:15Z

@fdwr out of interest: is there a way to track the bug that you opened for DirectML? Related to this: you mentioned that ONNXRuntime should attempt to flatten the leading dimensions. Is this being looked at somewhere?

@marovira: It's internal, but I can confirm that a teammate is working on it and looking at the change now. So there shouldn't need to be a need to update the ORT DML EP when DML directly supports it.

@fdwr That's great! Thanks for letting me know. Looking forward to when the fix is available.

github-actions bot added the ep:DML issues related to the DirectML execution provider label Aug 1, 2024

github-actions bot added the stale issues that have not been addressed in a while; categorized by a bot label Aug 31, 2024

github-actions bot removed the stale issues that have not been addressed in a while; categorized by a bot label Sep 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DirectML error: The parameter is incorrect with KBNet S #21583

DirectML error: The parameter is incorrect with KBNet S #21583

marovira commented Aug 1, 2024

marovira commented Aug 1, 2024

fdwr commented Aug 1, 2024

marovira commented Aug 1, 2024 •

edited

Loading

github-actions bot commented Aug 31, 2024

marovira commented Sep 1, 2024

fdwr commented Sep 6, 2024

marovira commented Sep 6, 2024

fdwr commented Sep 6, 2024 •

edited

Loading

marovira commented Sep 6, 2024 •

edited

Loading

fdwr commented Sep 6, 2024 •

edited

Loading

marovira commented Sep 6, 2024

fdwr commented Sep 7, 2024 •

edited

Loading

john-dance commented Oct 8, 2024

john-dance commented Oct 15, 2024

marovira commented Oct 28, 2024

fdwr commented Oct 31, 2024 •

edited

Loading

marovira commented Oct 31, 2024

DirectML error: The parameter is incorrect with KBNet S #21583

DirectML error: The parameter is incorrect with KBNet S #21583

Comments

marovira commented Aug 1, 2024

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

marovira commented Aug 1, 2024

Additional Info

fdwr commented Aug 1, 2024

marovira commented Aug 1, 2024 • edited Loading

github-actions bot commented Aug 31, 2024

marovira commented Sep 1, 2024

fdwr commented Sep 6, 2024

marovira commented Sep 6, 2024

fdwr commented Sep 6, 2024 • edited Loading

marovira commented Sep 6, 2024 • edited Loading

fdwr commented Sep 6, 2024 • edited Loading

marovira commented Sep 6, 2024

fdwr commented Sep 7, 2024 • edited Loading

john-dance commented Oct 8, 2024

john-dance commented Oct 15, 2024

marovira commented Oct 28, 2024

fdwr commented Oct 31, 2024 • edited Loading

marovira commented Oct 31, 2024

marovira commented Aug 1, 2024 •

edited

Loading

fdwr commented Sep 6, 2024 •

edited

Loading

marovira commented Sep 6, 2024 •

edited

Loading

fdwr commented Sep 6, 2024 •

edited

Loading

fdwr commented Sep 7, 2024 •

edited

Loading

fdwr commented Oct 31, 2024 •

edited

Loading