DMMHA: add unit tests; fix CPU, CUDA kernel #22567

mindest · 2024-10-23T23:32:42Z

Description

Fixes:
(1) cpu kernel: applying scale before bias and mask like other MHA ops
(2) cpu kernel: correct offset during appending past to present.
(3) cuda kernel: apply mask if provided; fix output_qk offset.

Add DMMHA unit tests

docs/ContribOperators.md

hariharans29 · 2024-10-24T00:19:15Z

When the PR is ready could you please update the PR title and description to better reflect the problem and fix ? Thanks.

onnxruntime/test/contrib_ops/decoder_masked_multihead_attention_op_test.cc

reduce error tolerance; make ToFloat (float) constexpr

mindest · 2024-10-28T17:48:26Z

@hariharans29, is it true that for cross attention CUDA kernel, the key layout is also reordered as [B, H, head_size/x, L, x],

onnxruntime/onnxruntime/contrib_ops/cuda/bert/fastertransformer_decoder_attention/decoder_masked_multihead_attention_impl.cu

Lines 105 to 107 in dd28f09

    
           // The layout of the cache is [B, H, head_size/x, L, x] with x == 4/8/16 for FP32/FP16/FP8. Since each thread 
        
           // owns x elements, we have to decompose the linear index into chunks of x values and the posi- 
        
           // tion of the thread in that chunk.

instead of BNSH?

onnxruntime/onnxruntime/core/graph/contrib_ops/bert_defs.cc

Lines 855 to 860 in dd28f09

    
           .Input(1, 
        
                  "key", 
        
                  "Key with shape (batch_size, 1, hidden_size) for self attention " 
        
                  "or past_key with shape (batch_size, num_heads, kv_sequence_length, head_size) for cross attention", 
        
                  "T", 
        
                  OpSchema::Optional)

onnxruntime/test/contrib_ops/decoder_masked_multihead_attention_op_test.cc

...ops/cuda/bert/fastertransformer_decoder_attention/decoder_masked_multihead_attention_impl.cu

mindest · 2024-10-29T05:50:58Z

/azp run Big Models,Linux Android Emulator QNN CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline

azure-pipelines · 2024-10-29T05:51:36Z

Azure Pipelines successfully started running 9 pipeline(s).

mindest · 2024-10-29T05:51:58Z

/azp run ONNX Runtime Web CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows GPU TensorRT CI Pipeline,Windows x64 QNN CI Pipeline

azure-pipelines · 2024-10-29T05:52:25Z

Azure Pipelines successfully started running 8 pipeline(s).

...ops/cuda/bert/fastertransformer_decoder_attention/decoder_masked_multihead_attention_impl.cu

kunal-vaishnavi · 2024-10-31T00:34:09Z

@hariharans29, is it true that for cross attention CUDA kernel, the key layout is also reordered as [B, H, head_size/x, L, x],

onnxruntime/onnxruntime/contrib_ops/cuda/bert/fastertransformer_decoder_attention/decoder_masked_multihead_attention_impl.cu

Lines 105 to 107 in dd28f09

// The layout of the cache is [B, H, head_size/x, L, x] with x == 4/8/16 for FP32/FP16/FP8. Since each thread

// owns x elements, we have to decompose the linear index into chunks of x values and the posi-

// tion of the thread in that chunk.

instead of BNSH?

onnxruntime/onnxruntime/core/graph/contrib_ops/bert_defs.cc

Lines 855 to 860 in dd28f09

.Input(1,

"key",

"Key with shape (batch_size, 1, hidden_size) for self attention "

"or past_key with shape (batch_size, num_heads, kv_sequence_length, head_size) for cross attention",

"T",

OpSchema::Optional)

For self-attention, parameters.k_cache = present_key_data = past_key_data since past_present_share_buffer = true. For cross-attention, parameters.k_cache = key_data.

onnxruntime/onnxruntime/contrib_ops/cuda/bert/decoder_masked_multihead_attention.cc

Lines 150 to 202 in dd28f09

    
           if (past_key == nullptr && present_key == nullptr) { 
        
             if (attention_bias != nullptr) { 
        
               return ORT_MAKE_STATUS(ONNXRUNTIME, NOT_IMPLEMENTED, 
        
                                      "DecoderMaskedMultiHeadAttention does not support attention bias for cross-attention"); 
        
             } 
        
             parameters.is_cross_attention = true; 
        
             parameters.total_sequence_length = parameters.kv_sequence_length; 
        
             parameters.max_sequence_length = parameters.kv_sequence_length; 
        
             // parameters.k and parameters.v are nullptr 
        
             parameters.k_cache = const_cast<T1*>(key->Data<T1>()); 
        
             parameters.v_cache = const_cast<T1*>(value->Data<T1>()); 
        
             parameters.k_bias = nullptr; 
        
             parameters.v_bias = nullptr; 
        
           } else { 
        
             // Sanity check 
        
             ORT_ENFORCE(past_present_share_buffer_); 
        
             ORT_ENFORCE(past_key != nullptr && past_value != nullptr); 
        
             auto* present_key_data = present_key->MutableData<T1>(); 
        
             auto* present_value_data = present_value->MutableData<T1>(); 
        
             auto* past_key_data = past_key->Data<T1>(); 
        
             auto* past_value_data = past_value->Data<T1>(); 
        
             // No production use-case will incur this copy cost as the implementation of 
        
             // GreedySearch/BeamSearch is written in such a way that the past and present buffers 
        
             // will be shared. 
        
             // This is just to circumvent the OpTester's limitation of not being able to bind a specific 
        
             // buffer to inputs/outputs. 
        
             if (present_key_data != past_key_data) { 
        
               CUDA_RETURN_IF_ERROR(cudaMemcpyAsync(present_key_data, past_key_data, past_key->SizeInBytes(), 
        
                                                    cudaMemcpyDeviceToDevice, cuda_stream)); 
        
             } 
        
             if (present_value_data != past_value_data) { 
        
               CUDA_RETURN_IF_ERROR(cudaMemcpyAsync(present_value_data, past_value_data, past_value->SizeInBytes(), 
        
                                                    cudaMemcpyDeviceToDevice, cuda_stream)); 
        
             } 
        
             parameters.is_cross_attention = false; 
        
             bool is_packed_qkv = (key == nullptr && value == nullptr); 
        
             parameters.is_packed_qkv = is_packed_qkv; 
        
             parameters.k = is_packed_qkv 
        
                                ? const_cast<T1*>(query->Data<T1>() + parameters.hidden_size) 
        
                                : const_cast<T1*>(key->Data<T1>()); 
        
             parameters.v = is_packed_qkv 
        
                                ? const_cast<T1*>(query->Data<T1>() + 2 * static_cast<size_t>(parameters.hidden_size)) 
        
                                : const_cast<T1*>(value->Data<T1>()); 
        
             parameters.k_cache = present_key_data; 
        
             parameters.v_cache = present_value_data; 
        
           }

I believe parameters.k_cache should be reordered before the kernel is launched so past_key is already reordered for self-attention and key is already reordered for cross-attention. This behavior would match the below comments.

onnxruntime/onnxruntime/contrib_ops/cpu/transformers/beam_search_impl_whisper.h

Lines 341 to 352 in 03ea5dc

    
           #ifdef USE_CUDA 
        
                 // Here we only need to reorder the past key for self-attention and cross-attention. 
        
                 for (size_t i = 0; i < 2 * static_cast<size_t>(decoder_subgraph_.num_layers); ++i) { 
        
                   ORT_RETURN_IF_ERROR(reorder_past_state_func_(cuda_device_prop_, 
        
                                                                *decoder_feeds[offset + 2 * i].GetMutable<Tensor>(), 
        
                                                                beam_state.staging_for_past_state_reorder, 
        
                                                                this->ort_stream_)); 
        
                 } 
        
                 size_t cache_indir_input_offset = static_cast<size_t>(decoder_subgraph_.GetFirstPastInputIndex()) + 4 * static_cast<size_t>(decoder_subgraph_.num_layers) + 2; 
        
                 ORT_RETURN_IF_ERROR(init_cache_indir_func_(*decoder_feeds[cache_indir_input_offset].GetMutable<Tensor>(), this->ort_stream_)); 
        
           #endif 
        
               }

tianleiwu · 2024-10-31T03:17:34Z

@kunal-vaishnavi, what's the reason that need cross attention in this op? I think cross attention shall be supported by MHA, and this op is only for decoding only. That could make logic more clearly.

mindest · 2024-10-31T06:37:54Z

To #22567 (comment): Makes sense to me, in the tests I also use reordered key as input for cross attention. Maybe I should also add some comments in

onnxruntime/onnxruntime/core/graph/contrib_ops/bert_defs.cc

Lines 855 to 860 in dd28f09

    
           .Input(1, 
        
                  "key", 
        
                  "Key with shape (batch_size, 1, hidden_size) for self attention " 
        
                  "or past_key with shape (batch_size, num_heads, kv_sequence_length, head_size) for cross attention", 
        
                  "T", 
        
                  OpSchema::Optional)

so that it is clear key in cross-attention is also reordered for CUDA EP.

mindest · 2024-10-31T06:45:19Z

Thanks @tianleiwu @kunal-vaishnavi for the review! Is this PR ready to merge, if I keep the following changes in another PR?

Support float16 for out_qk
Add comments why it is sum_tlength + 1 instead of tlength (for cross-attn total length is kv_seq_len, for self-attn it is past_len + 1)
Update schema comments of input 1 key for cross-attn.

kunal-vaishnavi · 2024-10-31T08:07:58Z

@kunal-vaishnavi, what's the reason that need cross attention in this op? I think cross attention shall be supported by MHA, and this op is only for decoding only. That could make logic more clearly.

Whisper uses alternating layers of self-attention and cross-attention during decoding.

...ops/cuda/bert/fastertransformer_decoder_attention/decoder_masked_multihead_attention_impl.cu

mindest · 2024-11-02T13:05:41Z

Thanks @tianleiwu, @kunal-vaishnavi, @hariharans29!

### Description Fixes: (1) cpu kernel: applying scale before bias and mask like other MHA ops (2) cpu kernel: correct offset during appending past to present. (3) cuda kernel: apply mask if provided; fix output_qk offset. Add DMMHA unit tests

mindest added 5 commits October 21, 2024 07:13

DecoderMaskedSelfAttention: Merge code for float and float16

41ad318

Merge more functions for float and float16

c294331

Fix errors; refactor

5f171e7

Update doc.

d79c3c3

Add unit tests for DecoderMaskedMultiHeadAttention.

9df4782

tianleiwu reviewed Oct 23, 2024

View reviewed changes

docs/ContribOperators.md Outdated Show resolved Hide resolved

mindest added 4 commits October 23, 2024 23:57

Fix scale order; revert doc change

50dfa85

Revert test change

c8ab330

Enable qk output for cross-attn only

3d64c5c

Fix scale order in reference code

5558436

github-advanced-security bot found potential problems Oct 24, 2024

View reviewed changes

onnxruntime/test/contrib_ops/decoder_masked_multihead_attention_op_test.cc Fixed Show fixed Hide fixed

mindest added 2 commits October 28, 2024 13:47

Fix mask in CPU and CUDA attn; fix offset for out_qk;

f8bb900

reduce error tolerance; make ToFloat (float) constexpr

Set macro USE_CUDA; float output qk.

69db052

mindest changed the title ~~[WIP] Add DMMHA unit tests with fix~~ DMMHA: add unit tests; fix CPU, CUDA kernel Oct 28, 2024

Add missed changes.

cef311c

mindest marked this pull request as ready for review October 28, 2024 17:48

mindest requested review from hariharans29 and kunal-vaishnavi October 28, 2024 17:49

Enable mask, attn_bias; reorder key in CUDA cross-attn

601b26d

github-advanced-security bot found potential problems Oct 28, 2024

View reviewed changes

mindest added 3 commits October 29, 2024 04:28

Fix error: type, softmax lowest

94cb346

Loosen tol for fp16.

8302d56

Fix: rm unused arg

6fedb82

kunal-vaishnavi reviewed Oct 29, 2024

View reviewed changes

...ops/cuda/bert/fastertransformer_decoder_attention/decoder_masked_multihead_attention_impl.cu Show resolved Hide resolved

tianleiwu approved these changes Oct 30, 2024

View reviewed changes

tianleiwu reviewed Oct 30, 2024

View reviewed changes

...ops/cuda/bert/fastertransformer_decoder_attention/decoder_masked_multihead_attention_impl.cu Show resolved Hide resolved

kunal-vaishnavi approved these changes Oct 30, 2024

View reviewed changes

hariharans29 reviewed Nov 1, 2024

View reviewed changes

...ops/cuda/bert/fastertransformer_decoder_attention/decoder_masked_multihead_attention_impl.cu Show resolved Hide resolved

mindest merged commit 4ffc1ff into main Nov 2, 2024
74 checks passed

mindest deleted the linmin/dmmha_test branch November 2, 2024 13:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DMMHA: add unit tests; fix CPU, CUDA kernel #22567

DMMHA: add unit tests; fix CPU, CUDA kernel #22567

mindest commented Oct 23, 2024 •

edited by tianleiwu

Loading

hariharans29 commented Oct 24, 2024

mindest commented Oct 28, 2024

mindest commented Oct 29, 2024

azure-pipelines bot commented Oct 29, 2024

mindest commented Oct 29, 2024

azure-pipelines bot commented Oct 29, 2024

kunal-vaishnavi commented Oct 31, 2024

tianleiwu commented Oct 31, 2024 •

edited

Loading

mindest commented Oct 31, 2024

mindest commented Oct 31, 2024

kunal-vaishnavi commented Oct 31, 2024

mindest commented Nov 2, 2024

DMMHA: add unit tests; fix CPU, CUDA kernel #22567

DMMHA: add unit tests; fix CPU, CUDA kernel #22567

Conversation

mindest commented Oct 23, 2024 • edited by tianleiwu Loading

Description

hariharans29 commented Oct 24, 2024

mindest commented Oct 28, 2024

mindest commented Oct 29, 2024

azure-pipelines bot commented Oct 29, 2024

mindest commented Oct 29, 2024

azure-pipelines bot commented Oct 29, 2024

kunal-vaishnavi commented Oct 31, 2024

tianleiwu commented Oct 31, 2024 • edited Loading

mindest commented Oct 31, 2024

mindest commented Oct 31, 2024

kunal-vaishnavi commented Oct 31, 2024

mindest commented Nov 2, 2024

mindest commented Oct 23, 2024 •

edited by tianleiwu

Loading

tianleiwu commented Oct 31, 2024 •

edited

Loading