Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DirectML error: The parameter is incorrect with KBNet S #21583

Open
marovira opened this issue Aug 1, 2024 · 17 comments
Open

DirectML error: The parameter is incorrect with KBNet S #21583

marovira opened this issue Aug 1, 2024 · 17 comments
Labels
ep:DML issues related to the DirectML execution provider

Comments

@marovira
Copy link
Contributor

marovira commented Aug 1, 2024

Describe the issue

When trying to run KBNet-S (see here) using ONNXRuntime with DirectML, an error occurs during the creation of the session that reads:

2024-07-31 17:26:51.9481346 [E:onnxruntime:, inference_session.cc:2045 onnxruntime::InferenceSession::Initialize::<lambda_deb1d9a98dc3fb814563870e4f4b9f20>::operator ()] Exception during initialization: D:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\AbiCustomRegistry.cpp(519)\onnxruntime_pybind11_state.pyd!00007FF9E88E1B92: (caller: 00007FF9E88485ED) Exception(3) tid(25d60) 80070057 The parameter is incorrect.

This issue does not appear when trying to run using the CPU execution provider.

To reproduce

Create a new virtual environment and run:

pip install torch onnx onnxruntime-directml

Then copy the following files from the KBNet repo:

Open kbnet_s_arch.py and modify the imports as follows:

# Replace this:
from basicsr.models.archs.kb_utils import KBAFunction
from basicsr.models.archs.kb_utils import LayerNorm2d, SimpleGate

# With this:
from kb_utils import KBAFunction, LayerNorm2d, SimpleGate

Next, download sidd.pth into the same directory from here

Finally, create a new file called test.py with the following code:

from kbnet import KBNet_s
import torch
import onnx
import onnxruntime

net = KBNet_s(lightweight=True, ffn_scale=1.5).cpu()
state = torch.load("sidd.pth", weights_only=True)
net.load_state_dict(state["model"])
net.eval()
x = torch.randn((1, 3, 128, 128))
with torch.no_grad():
    out = net(x)

torch.onnx.export(net, x, "sidd.onnx", export_params=True, do_constant_folding=True,
                  input_names=["input"],
                  output_names=["output"])

onnx_model = onnx.load("sidd.onnx")
onnx.checker.check_model(onnx_model) # Note that this passes, so the exported ONNX file is correct.

ort_session = onnxruntime.InferenceSession("sidd.onnx", providers=["DmlExecutionProvider"]) # <- This line fails!

Run with python3 test.py and see that it fails when trying to create the session. If instead the providers are set to providers=["CpuExecutionProvider"] or providers=["CpuExecutionProvider, DmlExecutionProvider"], the session is created correctly.

Urgency

Medium urgency. This is blocking a research task I'm currently working on and the deadline is coming up fast. I can work around the issue for now by using the CPU, but I need the GPU for performance reasons.

Platform

Windows

OS Version

Windows 11

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.18.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

DirectML

Execution Provider Library Version

DirectML 1.14.1.0

@github-actions github-actions bot added the ep:DML issues related to the DirectML execution provider label Aug 1, 2024
@marovira
Copy link
Contributor Author

marovira commented Aug 1, 2024

Additional Info

The issue can also be reproduced using C++ with ONNXRuntime and the DirectML provider. Through debugging, I've discovered that the issue is coming from the graph fusion system. Specifically, when it attempts to process /encoders.0/ecoders.0.0/MatMul_2, an exception is thrown when trying to create a DML_OPERATOR_GEMM. I am unable to determine why the parameters are incorrect however.

@fdwr
Copy link
Contributor

fdwr commented Aug 1, 2024

the issue is coming from the graph fusion system

I wonder if specifying a lower optimization level like GraphOptimizationLevel like ORT_ENABLE_BASIC would mitigate the issue until it can be investigated?

@marovira
Copy link
Contributor Author

marovira commented Aug 1, 2024

I wonder if specifying a lower optimization level like GraphOptimizationLevel like ORT_ENABLE_BASIC would mitigate the issue until it can be investigated?

I've just confirmed that setting the optimisation level to ORT_ENABLE_BASIC doesn't remove the error message:

2024-07-31 20:34:15.4335912 [E:onnxruntime:, inference_session.cc:2045 onnxruntime::InferenceSession::Initialize::<lambda_deb1d9a98dc3fb814563870e4f4b9f20>::operator ()] Exception during initialization: D:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\AbiCustomRegistry.cpp(519)\onnxruntime_pybind11_state.pyd!00007FF9979F1B92: (caller: 00007FF9979585ED) Exception(3) tid(1ca60) 80070057 The parameter is incorrect.

Interestingly, if I change the optimisation level to ORT_DISABLE_ALL, I get this output instead:

2024-07-31 20:36:05.2756227 [W:onnxruntime:, session_state.cc:1166 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-07-31 20:36:05.2793056 [W:onnxruntime:, session_state.cc:1168 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2024-07-31 20:36:07.4558489 [E:onnxruntime:, sequential_executor.cc:516 onnxruntime::ExecuteKernel] Non-zero status code returned while running MatMul node. Name:'/encoders.0/encoders.0.0/MatMul_2' Status Message: D:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\MLOperatorAuthorImpl.cpp(2468)\onnxruntime_pybind11_state.pyd!00007FF99799F21F: (caller: 00007FF9979A06DA) Exception(3) tid(259d4) 80070057 The parameter is incorrect.

That error message shows the node in which the exception is being thrown, which I mentioned in my previous message.

Copy link
Contributor

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

@github-actions github-actions bot added the stale issues that have not been addressed in a while; categorized by a bot label Aug 31, 2024
@marovira
Copy link
Contributor Author

marovira commented Sep 1, 2024

@fdwr is there any more information I can provide that would help with diagnosing/fixing this?

@github-actions github-actions bot removed the stale issues that have not been addressed in a while; categorized by a bot label Sep 2, 2024
@fdwr
Copy link
Contributor

fdwr commented Sep 6, 2024

@fdwr is there any more information I can provide that would help with diagnosing/fixing this?

Are there any DirectML debug layer messages, if you enable it?

#13330 (comment)
#15255 (comment)

@marovira
Copy link
Contributor Author

marovira commented Sep 6, 2024

Here's the output with the debug layer enabled:

D3D12 ERROR: An invalid dimension count of 5 was specified in tensor 'A' which is not between 2 and 4. [ UNKNOWN ERROR #1: STRING_FROM_APPLICATION]
C:\__w\1\s\SharedValidation/TensorValidator.h(753)\DirectML.Debug.dll!00007FFB58A81BAE: (caller: 00007FFB58A83B85) Exception(1) tid(5f48c) 80070057 The parameter is incorrect.
Exception thrown at 0x00007FFC80D5FABC in denoise.exe: Microsoft C++ exception: wil::ResultException at memory location 0x0000004AEE1936B0.
Exception thrown at 0x00007FFC80D5FABC in denoise.exe: Microsoft C++ exception: [rethrow] at memory location 0x0000000000000000.
C:\__w\1\s\Debug\Product\DmlDeviceDebug.cpp(80)\DirectML.Debug.dll!00007FFB58B9EB2E: (caller: 00007FFB5CD2719F) ReturnHr(1) tid(5f48c) 80070057 The parameter is incorrect.
    Msg:[C:\__w\1\s\SharedValidation/TensorValidator.h(753)\DirectML.Debug.dll!00007FFB58A81BAE: (caller: 00007FFB58A83B85) Exception(1) tid(5f48c) 80070057 The parameter is incorrect.
] 
E:\github\sdk\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\Operators\DmlOperator.cpp(46)\onnxruntimed.dll!00007FFB5C928719: (caller: 00007FFB5CCB99C9) Exception(1) tid(5f48c) 80070057 The parameter is incorrect.
    [Dml::DmlOperator::SetDmlOperatorDesc(m_dmlDevice->CreateOperator(&operatorDesc, __uuidof(**(&dmlOperator)), IID_PPV_ARGS_Helper(&dmlOperator)))]
Exception thrown at 0x00007FFC80D5FABC in denoise.exe: Microsoft C++ exception: wil::ResultException at memory location 0x0000004AEE193B70.
Exception thrown at 0x00007FFC80D5FABC in denoise.exe: Microsoft C++ exception: [rethrow] at memory location 0x0000000000000000.
E:\github\sdk\onnxruntime\onnxruntime\core/providers/dml/OperatorAuthorHelper/MLOperatorAuthorHelper.h(965)\onnxruntimed.dll!00007FFB5E9E1ABE: (caller: 00007FFB5CCB89BE) ReturnHr(1) tid(5f48c) 80070057 The parameter is incorrect.
    Msg:[E:\github\sdk\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\Operators\DmlOperator.cpp(46)\onnxruntimed.dll!00007FFB5C928719: (caller: 00007FFB5CCB99C9) Exception(1) tid(5f48c) 80070057 The parameter is incorrect.
    [Dml::DmlOperator::SetDmlOperatorDesc(m_dmlDevice->CreateOperator(&operatorDesc, __uuidof(**(&dmlOperator)), IID_PPV_ARGS_Helper(&dmlOperator)))]
] [MLOperatorKernel<class Dml::DmlOperatorMatMul>::CreateInstance]
E:\github\sdk\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\Operators\DmlOperatorMatMul.cpp(58)\onnxruntimed.dll!00007FFB5C928719: (caller: 00007FFB5CA5B950) Exception(2) tid(5f48c) 80070057 The parameter is incorrect.
    [Dml::CreateMatMul(MLOperatorKernel<T>::CreateInstance(*kernelInfo, opKernel))]
Exception thrown at 0x00007FFC80D5FABC in denoise.exe: Microsoft C++ exception: wil::ResultException at memory location 0x0000004AEE194F90.
Exception thrown at 0x00007FFC80D5FABC in denoise.exe: Microsoft C++ exception: [rethrow] at memory location 0x0000000000000000.
E:\github\sdk\onnxruntime\onnxruntime\core/providers/dml/OperatorAuthorHelper/MLOperatorAuthorHelper.h(1081)\onnxruntimed.dll!00007FFB5E98793B: (caller: 00007FFB5CB6070C) ReturnHr(2) tid(5f48c) 80070057 The parameter is incorrect.
    Msg:[E:\github\sdk\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\Operators\DmlOperatorMatMul.cpp(58)\onnxruntimed.dll!00007FFB5C928719: (caller: 00007FFB5CA5B950) Exception(2) tid(5f48c) 80070057 The parameter is incorrect.
    [Dml::CreateMatMul(MLOperatorKernel<T>::CreateInstance(*kernelInfo, opKernel))]
] [MLOperatorKernelFactory::CreateKernel]
E:\github\sdk\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\AbiCustomRegistry.cpp(519)\onnxruntimed.dll!00007FFB5C928719: (caller: 00007FFB5CB6D93F) Exception(3) tid(5f48c) 80070057 The parameter is incorrect.
    [Windows::AI::MachineLearning::Adapter::AbiCustomRegistry::RegisterOperatorKernel::<lambda_4e824286f9d658116fa9a3df675eaad5>::operator ()(kernelFactoryCapture->CreateKernel(kernelInfoWrapper.Get(), kernel.GetAddressOf()))]
Exception thrown at 0x00007FFC80D5FABC in denoise.exe: Microsoft C++ exception: wil::ResultException at memory location 0x0000004AEE195020.

@fdwr
Copy link
Contributor

fdwr commented Sep 6, 2024

Oooh, so there's a 5D matmul in this model then?

D3D12 ERROR: An invalid dimension count of 5 was specified in tensor 'A' which is not between 2 and 4.
E:\github\sdk\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\Operators\DmlOperatorMatMul.cpp(58)\onnxruntimed.dll!00007FFB5C928719: (caller: 00007FFB5CA5B950) Exception(2) tid(5f48c) 80070057 The parameter is incorrect.

Would you know if there's an upload on HuggingFace or elsewhere of the direct .onnx model?

I've not actually encountered a 5D matmul before in ONNX models, and DML_GEMM_OPERATOR_DESC currently only accepts 4D, requiring either DirectML.dll updates or some flattening to 4D of leading dimensions (if not broadcasted) before calling DirectML.

@marovira
Copy link
Contributor Author

marovira commented Sep 6, 2024

Would you know if there's an upload on HuggingFace or elsewhere of the direct .onnx model?

No, the authors only provide the PTH files. I'll see if I can post it somewhere so you can download the ONNX file.

Edit:
I've uploaded the ONNX file to Google Drive

@fdwr
Copy link
Contributor

fdwr commented Sep 6, 2024

Edit: I've uploaded the ONNX file to Google Drive

  • Thanks - downloading now.
  • Downloaded.
  • I can repro it locally.

image

@marovira
Copy link
Contributor Author

marovira commented Sep 6, 2024

Let me know if there's anything else I can do to help.

@fdwr
Copy link
Contributor

fdwr commented Sep 7, 2024

Reduced to minimal repro, a single operator .onnx file: minimal-repro.zip

image

Opened bug for DirectML.dll. I'll see what the response is, but in the meantime, we should probably attempt to flatten the leading dimensions when >4D here https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/dml/DmlExecutionProvider/src/Operators/DmlOperatorGemm.cpp#L36. Cheers TotK fan.

@john-dance
Copy link

See also #21875
This same issue happens with the following models on a Windows on ARM machine.
https://aihub.qualcomm.com/mobile/models/ffnet_54s
https://aihub.qualcomm.com/models/esrgan
https://aihub.qualcomm.com/models/whisper_tiny_en
https://aihub.qualcomm.com/models/mediapipe_hand (MediaPipeHandLandmarkDetector)

@john-dance
Copy link

Note: This seems to have been fixed after upgrading to DirectML.dll 1.15.1. I have verified that the failures I reported above now all work.

@marovira
Copy link
Contributor Author

@fdwr out of interest: is there a way to track the bug that you opened for DirectML?
Related to this: you mentioned that ONNXRuntime should attempt to flatten the leading dimensions. Is this being looked at somewhere?

@fdwr
Copy link
Contributor

fdwr commented Oct 31, 2024

@fdwr out of interest: is there a way to track the bug that you opened for DirectML? Related to this: you mentioned that ONNXRuntime should attempt to flatten the leading dimensions. Is this being looked at somewhere?

@marovira: It's internal, but I can confirm that a teammate is working on it and looking at the change now. So there shouldn't need to be a need to update the ORT DML EP when DML directly supports it.

@marovira
Copy link
Contributor Author

@fdwr out of interest: is there a way to track the bug that you opened for DirectML? Related to this: you mentioned that ONNXRuntime should attempt to flatten the leading dimensions. Is this being looked at somewhere?

@marovira: It's internal, but I can confirm that a teammate is working on it and looking at the change now. So there shouldn't need to be a need to update the ORT DML EP when DML directly supports it.

@fdwr That's great! Thanks for letting me know. Looking forward to when the fix is available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:DML issues related to the DirectML execution provider
Projects
None yet
Development

No branches or pull requests

3 participants