Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU][FP16][NVIDIA L4][SM89] Inference failed because of missing fp16 kernel with specific values of "s" and "d" #19259

Closed
claeyzre opened this issue Jan 24, 2024 · 4 comments
Labels
ep:CUDA issues related to the CUDA execution provider model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. platform:windows issues related to the Windows platform

Comments

@claeyzre
Copy link

Describe the issue

While trying to make inference on a model on new NVIDIA hardware (the L4 Graphic Card), I encounter error because of a missing kernel.

Status Message: D:\a\_work\1\s\onnxruntime\contrib_ops/cuda/bert/tensorrt_fused_multihead_attention/fused_multihead_attention_v2.h:3189 onnxruntime::contrib::cuda::FusedMultiHeadAttentionXMMAKernelV2::run findIter != mFunctions.end() was false. Failed to find kernel: s=256 d=32 interleaved=0 forceUnroll=0 withRelativePositionBias=0 flash_attention=0 causal_mask=0
Was the plugin compiled on a compatible CUDA and SM version?
	 Compiled on CUDA 11080
	 Current SM version: 89

When doing a simple search in the ORT repository, there is indeed no kernel for that version of SM (89) with the specified value of s and d. I can run this ONNX model just fine on my GPU (which is not as new) as well as the previous NVIDIA GPU for this line, the T4 Card.

I understand that those are generated but I don't know how to do or where to request this generation for this model to run correctly. Would it be possible to generate those ?

To reproduce

I'll try to give a model shortly but I am not sure there is a need for a model for this issue to be solved.

Urgency

No response

Platform

Windows

OS Version

11

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.16.3

ONNX Runtime API

C#

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

11.8

@github-actions github-actions bot added ep:CUDA issues related to the CUDA execution provider model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. platform:windows issues related to the Windows platform labels Jan 24, 2024
@tianleiwu
Copy link
Contributor

SM 89 can also use SM 80 kernel. For SM 80, there is fp16 kernel for S=256, D=32. Let me try fix it by changing code.

@LoicDagnas
Copy link

LoicDagnas commented Jan 29, 2024

hello @tianleiwu,
Also a problem on my side, any news concerning this issue? 🙏🏼

@tianleiwu
Copy link
Contributor

That logic actually existed in the code:

if (mSM == kSM_86 || mSM == kSM_89) {
loadXMMAKernels(kSM_80);
}

I did not reproduce the issue with RTX 4090 GPU, which is also SM89. The corresponding kernel can be loaded:
image

I tried an onnx model for all-MiniLM-L6-v2 with python API, the model has D=32, and it is working well with sequence length S=256.

Did you see some message in console like skip loading trt fused attention kernel fmha_v2_fp16_256_32_sm80_kernel because no enough shared memory?

@claeyzre
Copy link
Author

it turns out that we were not in 1.16.3 for the test but an old version.
We proceeded with the upgrade and it works fine.
Sorry for the disturbance and thank you very much for your time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:CUDA issues related to the CUDA execution provider model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. platform:windows issues related to the Windows platform
Projects
None yet
Development

No branches or pull requests

3 participants