[CUDA] Fix performance bug in DecoderMaskedMultiheadAttention for BeamSearch #17613

hariharans29 · 2023-09-19T18:51:45Z

Description

Minor change of moving a bounds checking if outside the some loops
Move out a global memory read outside an unrolled for loop. Reading in a loop is redundant as ti never changes in the loop. This seems to be a case for compiler optimization to move it outside the loop but the compiler doesn't seem to move it out and instead adds instructions for multiple redundant global memory reads

Illustrative profiling from HugingFace GPT-2

Before instructions:

There are more such redundant reads than shown in the snippet as they are interleaved with other independently executing instructions based on compiler re-arrangement of instructions

Before metrics:

After instructions:
We have one load instead of the many above (not pasting for conciseness)

After metrics:

Commentary about metrics:
As you can see, the number of cycles have decreased and latency has decreased (about 8.x micro-seconds to 7.x micro-seconds) and the memory throughput has gone down because we reduce the global memory read(s) mani-fold. This bug didn't impact perf much beause the redundant reads mostly happen in parallel (once the loop is unrolled) and I think the kernel wasn't nearing the global memory bandwidth ceiling yet. In any case, the ultimate perf impact will vary from model to model (based on head size, number of layers)

Motivation and Context

Fix perf bug in decoder masked multihead attention

...ops/cuda/bert/fastertransformer_decoder_attention/decoder_masked_multihead_attention_impl.cu

…mSearch (microsoft#17613)

Main Optimized

11f6a18

hariharans29 changed the title ~~Fix performance bug in DecoderMaskedMultiheadAttention~~ [CUDA] Fix performance bug in DecoderMaskedMultiheadAttention for BeamSearch Sep 19, 2023

hariharans29 requested review from zhanghuanrong, wangyems, tianleiwu and yufenglee September 19, 2023 18:53

hariharans29 commented Sep 19, 2023

View reviewed changes

...ops/cuda/bert/fastertransformer_decoder_attention/decoder_masked_multihead_attention_impl.cu Show resolved Hide resolved

hariharans29 commented Sep 19, 2023

View reviewed changes

...ops/cuda/bert/fastertransformer_decoder_attention/decoder_masked_multihead_attention_impl.cu Show resolved Hide resolved

tianleiwu approved these changes Sep 19, 2023

View reviewed changes

wangyems approved these changes Sep 19, 2023

View reviewed changes

hariharans29 merged commit c65e892 into main Sep 20, 2023
86 checks passed

hariharans29 deleted the hari/main_opt branch September 20, 2023 17:35

kleiti pushed a commit to kleiti/onnxruntime that referenced this pull request Mar 22, 2024

[CUDA] Fix performance bug in DecoderMaskedMultiheadAttention for Bea…

e983eb5

…mSearch (microsoft#17613)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] Fix performance bug in DecoderMaskedMultiheadAttention for BeamSearch #17613

[CUDA] Fix performance bug in DecoderMaskedMultiheadAttention for BeamSearch #17613

hariharans29 commented Sep 19, 2023 •

edited

Loading

[CUDA] Fix performance bug in DecoderMaskedMultiheadAttention for BeamSearch #17613

[CUDA] Fix performance bug in DecoderMaskedMultiheadAttention for BeamSearch #17613

Conversation

hariharans29 commented Sep 19, 2023 • edited Loading

Description

Motivation and Context

hariharans29 commented Sep 19, 2023 •

edited

Loading