FlashAttention actually does not support attention mask #116

HJoonKwon · 2024-02-17T05:48:32Z

Thanks for your great work!

I'm just curious whether your code here is using flash or not when mask is not None. My guess is it's using memory efficient attention instead since PyTorch flash attention kernel does not support attention mask. In addition, if memory efficient was used, half() would not have been needed when mask is not None.
Thank you!

++ I did some experiments. Even if sdp_flash is enabled, it is not executed when mask is not None. If we force PyTorch to use flash, it spits out an error like below.

class Attention(nn.Module):
    def __init__(self, attn_dropout=0.0):
        super().__init__()
        self.attn_dropout = attn_dropout

    def forward(self, q, k, v, q_mask=None, kv_mask=None):
        if kv_mask is not None:
            attn_mask = q_mask[:, None, :, None] * kv_mask[:, None, None, :]
        else:
            attn_mask = None
        with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
            y = torch.nn.functional.scaled_dot_product_attention(
                q, k, v, attn_mask=attn_mask, dropout_p=self.attn_dropout, is_causal=False
            )
            
        return y if attn_mask is None else y.nan_to_num()

device = 'cuda'
attn = Attention().to(device)
B = 4
L = 32 * 32
S = 24 * 24
n_embd = 32
n_heads = 4
q = torch.randn(B, n_heads, L, n_embd // n_heads).to(device)
k = torch.randn(B, n_heads, S, n_embd // n_heads).to(device)
v = torch.randn(B, n_heads, S, n_embd // n_heads).to(device)
q_mask = (torch.rand(B, L) > 0.1).to(device)
kv_mask = (torch.rand(B, S) > 0.1).to(device)
x = [x.half() for x in [q, k, v]]
y = attn(*x, q_mask, kv_mask)

/tmp/ipykernel_467687/3943656874.py:12: UserWarning: Memory efficient kernel not used because: (Triggered internally at /opt/conda/conda-bld/pytorch_1702400410390/work/aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:367.)
  y = torch.nn.functional.scaled_dot_product_attention(
/tmp/ipykernel_467687/3943656874.py:12: UserWarning: Memory Efficient attention has been runtime disabled. (Triggered internally at /opt/conda/conda-bld/pytorch_1702400410390/work/aten/src/ATen/native/transformers/sdp_utils_cpp.h:437.)
  y = torch.nn.functional.scaled_dot_product_attention(
/tmp/ipykernel_467687/3943656874.py:12: UserWarning: Flash attention kernel not used because: (Triggered internally at /opt/conda/conda-bld/pytorch_1702400410390/work/aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:369.)
  y = torch.nn.functional.scaled_dot_product_attention(
/tmp/ipykernel_467687/3943656874.py:12: UserWarning: Both fused kernels do not support non-null attn_mask. (Triggered internally at /opt/conda/conda-bld/pytorch_1702400410390/work/aten/src/ATen/native/transformers/sdp_utils_cpp.h:261.)
  y = torch.nn.functional.scaled_dot_product_attention(
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[34], line 12
     10 kv_mask = (torch.rand(B, S) > 0.1).to(device)
     11 x = [x.half() for x in [q, k, v]]
---> 12 y = attn(*x, q_mask, kv_mask)

File ~/miniconda3/envs/torch212/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File ~/miniconda3/envs/torch212/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

Cell In[32], line 12, in TorchNativeAttention.forward(self, q, k, v, q_mask, kv_mask)
     10     attn_mask = None
     11 with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
---> 12     y = torch.nn.functional.scaled_dot_product_attention(
     13         q, k, v, attn_mask=attn_mask, dropout_p=self.attn_dropout, is_causal=False
     14     )
     16 return y if attn_mask is None else y.nan_to_num()

RuntimeError: No available kernel.  Aborting execution.

while memory efficient kernel does not

with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_math=False, enable_mem_efficient=True):
            y = torch.nn.functional.scaled_dot_product_attention(
                q, k, v, attn_mask=attn_mask, dropout_p=self.attn_dropout, is_causal=False
            )

The text was updated successfully, but these errors were encountered:

Phil26AT · 2024-02-22T19:08:19Z

Hey @HJoonKwon! Damn, very good find, thank you! I guess this does matter in compiled forward, where we are padding inputs to static dimensions. We'd need to run the benchmarks, but maybe avoiding the call to half() could improve throughput then.

HJoonKwon · 2024-02-24T12:53:57Z

@Phil26AT Great! Thank you again for your great work. I got inspired a lot.

LudvigDillen · 2024-03-14T09:19:57Z

On the topic of FlashAttention, you link to FlashAttention and not FlashAttention2 here

Isn't the second version used? If not, why? Seems quite much faster

FlashAttention: https://arxiv.org/abs/2205.14135
FlashAttention2: https://arxiv.org/pdf/2307.08691.pdf?trk=public_post_comment-text

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FlashAttention actually does not support attention mask #116

FlashAttention actually does not support attention mask #116

HJoonKwon commented Feb 17, 2024 •

edited

Loading

Phil26AT commented Feb 22, 2024

HJoonKwon commented Feb 24, 2024 •

edited

Loading

LudvigDillen commented Mar 14, 2024

FlashAttention actually does not support attention mask #116

FlashAttention actually does not support attention mask #116

Comments

HJoonKwon commented Feb 17, 2024 • edited Loading

Phil26AT commented Feb 22, 2024

HJoonKwon commented Feb 24, 2024 • edited Loading

LudvigDillen commented Mar 14, 2024

HJoonKwon commented Feb 17, 2024 •

edited

Loading

HJoonKwon commented Feb 24, 2024 •

edited

Loading