Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUDA] Fix performance bug in DecoderMaskedMultiheadAttention for BeamSearch #17613

Merged
merged 1 commit into from
Sep 20, 2023

Conversation

hariharans29
Copy link
Member

@hariharans29 hariharans29 commented Sep 19, 2023

Description

  1. Minor change of moving a bounds checking if outside the some loops

  2. Move out a global memory read outside an unrolled for loop. Reading in a loop is redundant as ti never changes in the loop. This seems to be a case for compiler optimization to move it outside the loop but the compiler doesn't seem to move it out and instead adds instructions for multiple redundant global memory reads

image

Illustrative profiling from HugingFace GPT-2

Before instructions:

There are more such redundant reads than shown in the snippet as they are interleaved with other independently executing instructions based on compiler re-arrangement of instructions

image

Before metrics:

image

After instructions:
We have one load instead of the many above (not pasting for conciseness)

After metrics:
image

Commentary about metrics:
As you can see, the number of cycles have decreased and latency has decreased (about 8.x micro-seconds to 7.x micro-seconds) and the memory throughput has gone down because we reduce the global memory read(s) mani-fold. This bug didn't impact perf much beause the redundant reads mostly happen in parallel (once the loop is unrolled) and I think the kernel wasn't nearing the global memory bandwidth ceiling yet. In any case, the ultimate perf impact will vary from model to model (based on head size, number of layers)

Motivation and Context

Fix perf bug in decoder masked multihead attention

@hariharans29 hariharans29 changed the title Fix performance bug in DecoderMaskedMultiheadAttention [CUDA] Fix performance bug in DecoderMaskedMultiheadAttention for BeamSearch Sep 19, 2023
@hariharans29 hariharans29 merged commit c65e892 into main Sep 20, 2023
86 checks passed
@hariharans29 hariharans29 deleted the hari/main_opt branch September 20, 2023 17:35
kleiti pushed a commit to kleiti/onnxruntime that referenced this pull request Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants