Extend Attention Bias Broadcast Support #21710

tianleiwu · 2024-08-12T19:32:47Z

Description

Previously, MultiHeadAttention supports relative position bias of shape [1, N, S, T] or [B, N, S, T], and DecoderMaskedMultiHeadAttention supports [1, N, S, T]. This will extend the support to allow [1, N, S, T], [B, N, S, T], [B, 1, S, T] and [1, 1, S, T] for CUDA and CPU EPs.

Rename the input of "relative position bias" to "attention bias" because it can also be used for other types of bias, like ALiBi (Attention with Linear Biases) or attention mask.
Update unfused kernel to support broadcasting 2nd dimension of attention bias.
Update efficient attention to support broadcasting 2nd dimension of attention bias.
Update operators (MultiHeadAttention, DecoderMaskedMultiHeadAttention, Attention, PackedAttention, PackedMultiHeadAttention) to support broadcast attention bias on CUDA and CPU EPs.
Update ROCm, DML and WebGPU naming to be consistent. (Note that those EPs do not support broadcasting attention_bias for now).
Add attention bias tests for MultiHeadAttention.
Update operator documents
Update benchmark script

Other changes:

Fix some checks in multihead-attention.ts
Add helper functions to dump tensors given dimensions.

Motivation and Context

onnxruntime/core/providers/dml/DmlExecutionProvider/src/Operators/DmlOperatorAttention.cpp

docs/ContribOperators.md

tianleiwu added 2 commits August 12, 2024 08:28

broadcast attention_bias dim 0 and 1

91284f0

broadcast attn bias in decoder masked mha

c76f294

tianleiwu marked this pull request as draft August 12, 2024 19:32

tianleiwu added 3 commits August 13, 2024 06:44

Add MHA tests

a8cebba

rename relative_position_bias to attention_bias

c728b0b

fix build

2bff188

github-advanced-security bot found potential problems Aug 15, 2024

View reviewed changes

onnxruntime/core/providers/dml/DmlExecutionProvider/src/Operators/DmlOperatorAttention.cpp Fixed Show fixed Hide fixed

tianleiwu added 4 commits August 15, 2024 03:42

update doc

58792dd

Merge branch 'main' into tlwu/mha_attn_bias

801a86e

format js

acfd611

refactoring

1eb8c6b

tianleiwu marked this pull request as ready for review August 15, 2024 15:09

tianleiwu requested a review from a team as a code owner August 15, 2024 15:09

tianleiwu requested review from fs-eire, wangyems, kunal-vaishnavi and yufenglee August 15, 2024 15:18

wangyems reviewed Aug 15, 2024

View reviewed changes

docs/ContribOperators.md Show resolved Hide resolved

tianleiwu added 5 commits August 15, 2024 18:11

refactoring cpu; add comments

6766b17

refine softmax kernel

4984b45

benchmark mha with attention bias

a7e221b

mark maybe_unused

1226c6d

refine attn_bias_offset for dmmha with asummption of S=1

0c7f395

tianleiwu requested a review from wangyems August 16, 2024 15:52

kunal-vaishnavi approved these changes Aug 16, 2024

View reviewed changes

jchen351 approved these changes Aug 16, 2024

View reviewed changes

tianleiwu merged commit d79e3c5 into main Aug 16, 2024
95 of 97 checks passed

tianleiwu deleted the tlwu/mha_attn_bias branch August 16, 2024 22:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend Attention Bias Broadcast Support #21710

Extend Attention Bias Broadcast Support #21710

tianleiwu commented Aug 12, 2024 •

edited

Loading

Extend Attention Bias Broadcast Support #21710

Extend Attention Bias Broadcast Support #21710

Conversation

tianleiwu commented Aug 12, 2024 • edited Loading

Description

Motivation and Context

tianleiwu commented Aug 12, 2024 •

edited

Loading