Use Expand for Attention mask broadcasting instead of Concat #18159

PatriceVignola · 2023-10-30T08:07:45Z

Concatenating the same tensors many times with itself doesn't scale as well as simply broadcasting it using Expand. At least for DirectML, using Expand instead of Concat improves the perf quite a bite, but I can't imagine Concat being faster than Expand in any implementation since Expand can make more assumptions (e.g. it knows it has to deal with a single tensor).

It doesn't affect the GQA path since that path completely gets rid of the mask anyway.

PatriceVignola added 2 commits October 29, 2023 04:13

Change concat for Expand in attention mask reshaping

e6cc581

Make initializer name unique

164d3a8

PatriceVignola requested review from tianleiwu and kunal-vaishnavi October 30, 2023 08:07

PatriceVignola closed this Oct 30, 2023

PatriceVignola deleted the user/pavignol/improve-attention-mask-broadcasting branch October 30, 2023 10:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Expand for Attention mask broadcasting instead of Concat #18159

Use Expand for Attention mask broadcasting instead of Concat #18159

PatriceVignola commented Oct 30, 2023

Use Expand for Attention mask broadcasting instead of Concat #18159

Use Expand for Attention mask broadcasting instead of Concat #18159

Conversation

PatriceVignola commented Oct 30, 2023