[GPU][FP16][NVIDIA L4][SM89] Inference failed because of missing fp16 kernel with specific values of "s" and "d" #19259

claeyzre · 2024-01-24T09:54:32Z

Describe the issue

While trying to make inference on a model on new NVIDIA hardware (the L4 Graphic Card), I encounter error because of a missing kernel.

Status Message: D:\a\_work\1\s\onnxruntime\contrib_ops/cuda/bert/tensorrt_fused_multihead_attention/fused_multihead_attention_v2.h:3189 onnxruntime::contrib::cuda::FusedMultiHeadAttentionXMMAKernelV2::run findIter != mFunctions.end() was false. Failed to find kernel: s=256 d=32 interleaved=0 forceUnroll=0 withRelativePositionBias=0 flash_attention=0 causal_mask=0
Was the plugin compiled on a compatible CUDA and SM version?
	 Compiled on CUDA 11080
	 Current SM version: 89

When doing a simple search in the ORT repository, there is indeed no kernel for that version of SM (89) with the specified value of s and d. I can run this ONNX model just fine on my GPU (which is not as new) as well as the previous NVIDIA GPU for this line, the T4 Card.

I understand that those are generated but I don't know how to do or where to request this generation for this model to run correctly. Would it be possible to generate those ?

To reproduce

I'll try to give a model shortly but I am not sure there is a need for a model for this issue to be solved.

Urgency

No response

Platform

Windows

OS Version

11

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.16.3

ONNX Runtime API

C#

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

11.8

The text was updated successfully, but these errors were encountered:

tianleiwu · 2024-01-24T20:49:23Z

SM 89 can also use SM 80 kernel. For SM 80, there is fp16 kernel for S=256, D=32. Let me try fix it by changing code.

LoicDagnas · 2024-01-29T15:31:12Z

hello @tianleiwu,
Also a problem on my side, any news concerning this issue? 🙏🏼

tianleiwu · 2024-01-29T19:47:50Z

That logic actually existed in the code:

onnxruntime/onnxruntime/contrib_ops/cuda/bert/tensorrt_fused_multihead_attention/fused_multihead_attention.h

Lines 110 to 112 in 9e69606

    
           if (mSM == kSM_86 || mSM == kSM_89) { 
        
             loadXMMAKernels(kSM_80); 
        
           }

I did not reproduce the issue with RTX 4090 GPU, which is also SM89. The corresponding kernel can be loaded:

I tried an onnx model for all-MiniLM-L6-v2 with python API, the model has D=32, and it is working well with sequence length S=256.

Did you see some message in console like skip loading trt fused attention kernel fmha_v2_fp16_256_32_sm80_kernel because no enough shared memory?

claeyzre · 2024-01-31T12:30:34Z

it turns out that we were not in 1.16.3 for the test but an old version.
We proceeded with the upgrade and it works fine.
Sorry for the disturbance and thank you very much for your time

github-actions bot added ep:CUDA issues related to the CUDA execution provider model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. platform:windows issues related to the Windows platform labels Jan 24, 2024

claeyzre closed this as completed Jan 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU][FP16][NVIDIA L4][SM89] Inference failed because of missing fp16 kernel with specific values of "s" and "d" #19259

[GPU][FP16][NVIDIA L4][SM89] Inference failed because of missing fp16 kernel with specific values of "s" and "d" #19259

claeyzre commented Jan 24, 2024

tianleiwu commented Jan 24, 2024

LoicDagnas commented Jan 29, 2024 •

edited

Loading

tianleiwu commented Jan 29, 2024

claeyzre commented Jan 31, 2024

[GPU][FP16][NVIDIA L4][SM89] Inference failed because of missing fp16 kernel with specific values of "s" and "d" #19259

[GPU][FP16][NVIDIA L4][SM89] Inference failed because of missing fp16 kernel with specific values of "s" and "d" #19259

Comments

claeyzre commented Jan 24, 2024

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

tianleiwu commented Jan 24, 2024

LoicDagnas commented Jan 29, 2024 • edited Loading

tianleiwu commented Jan 29, 2024

claeyzre commented Jan 31, 2024

LoicDagnas commented Jan 29, 2024 •

edited

Loading