Is there any possible to support flash_attention backend? #1628

BBuf · 2024-10-11T08:38:01Z

BBuf
Oct 11, 2024

I tested the LLaMA3-8B model on a single 4090 gpu and found that the flashinfer calculation of attention is much slower than the flash attention library. I would like to ask if sglang can support the flash attention backend.

sglang flashinfer:

vllm flash_attention:

Flashinfer would be two times slower than flash_attention in here. Fortunately, we did not observe flashinfer being slower than flash attention in other models such as qwen2-72b. Instead, sglang's overall throughput is much stronger than vllm.

I want to know is there any possible to support flash_attention backend? I believe that in certain GPU architectures such as 4090 or certain shapes, FlashInfer has performance bad cases.

I also had a try, code here:

class FlashAttentionBackend(AttentionBackend):
    def __init__(self, model_runner: ModelRunner):
        from flash_attn.flash_attn_interface import (
            flash_attn_varlen_func,
            flash_attn_with_kvcache,
        )

        super().__init__()

        self.flash_attn_varlen_func = flash_attn_varlen_func
        self.flash_attn_with_kvcache = flash_attn_with_kvcache
        self.num_head = (
            model_runner.model_config.num_attention_heads // model_runner.tp_size
        )
        self.head_dim = (
            model_runner.model_config.hidden_size
            // model_runner.model_config.num_attention_heads
        )

        self.forward_metadata = None

        self.cuda_graph_max_seq_len = model_runner.model_config.context_len

    def init_forward_metadata(self, batch: ScheduleBatch, input_metadata: InputMetadata):
        """初始化 FlashAttention 后端的辅助变量"""

        self.seq_lens = input_metadata.seq_lens  # [batch_size], torch.int32

        self.cu_seqlens_q = torch.zeros(
            self.seq_lens.size(0) + 1, dtype=torch.int32, device=self.seq_lens.device
        )
        self.cu_seqlens_q[1:] = torch.cumsum(self.seq_lens, dim=0)

        self.cu_seqlens_k = self.cu_seqlens_q 

        self.max_seqlen_q = torch.max(self.seq_lens).item()
        self.max_seqlen_k = self.max_seqlen_q

    def init_cuda_graph_state(self, max_bs: int):
        self.cuda_graph_max_total_num_tokens = max_bs * self.cuda_graph_max_seq_len

        self.cuda_graph_cu_seqlens_q = torch.zeros(
            max_bs + 1, dtype=torch.int32, device='cuda'
        )

    def init_forward_metadata_capture_cuda_graph(
        self, bs: int, req_pool_indices, seq_lens
    ):
        self.seq_lens = seq_lens

        # 初始化 cu_seqlens_q
        self.cuda_graph_cu_seqlens_q.zero_()
        self.cuda_graph_cu_seqlens_q[1 : bs + 1] = torch.cumsum(seq_lens[:bs], dim=0)

        self.cu_seqlens_q = self.cuda_graph_cu_seqlens_q
        self.cu_seqlens_k = self.cu_seqlens_q

        self.max_seqlen_q = self.cuda_graph_max_seq_len
        self.max_seqlen_k = self.cuda_graph_max_seq_len

    def init_forward_metadata_replay_cuda_graph(
        self, bs: int, req_pool_indices, seq_lens
    ):

        self.cuda_graph_cu_seqlens_q.zero_()
        self.cuda_graph_cu_seqlens_q[1 : bs + 1] = torch.cumsum(seq_lens[:bs], dim=0)

        self.cu_seqlens_q = self.cuda_graph_cu_seqlens_q
        self.cu_seqlens_k = self.cu_graph_cu_seqlens_q

    def get_cuda_graph_seq_len_fill_value(self):
        return 0


    def forward_extend(self, q, k, v, layer: nn.Module, input_metadata: InputMetadata):
        """
        扩展模式下的前向计算，并使用 kv cache。
        """
        assert k.dtype == q.dtype, "k的dtype必须与q的dtype相同"
        assert v.dtype == q.dtype, "v的dtype必须与q的dtype相同"

        num_heads_q = layer.tp_q_head_num    
        head_dim_qk = layer.qk_head_dim     

        q = q.view(-1, num_heads_q, head_dim_qk)

        layer_id = layer.layer_id
        input_metadata.token_to_kv_pool.set_kv_buffer(
            layer_id, input_metadata.out_cache_loc, k, v
        )

        k_cache = input_metadata.token_to_kv_pool.get_key_buffer(layer_id)  # [total_cache_k, num_heads_k, head_dim_qk]
        v_cache = input_metadata.token_to_kv_pool.get_value_buffer(layer_id)  # [total_cache_v, num_heads_v, head_dim_v]


        output = self.flash_attn_varlen_func(
            q.contiguous(),
            k_cache.contiguous(),
            v_cache.contiguous(),
            cu_seqlens_q=self.cu_seqlens_q,
            max_seqlen_q=self.max_seqlen_q,
            cu_seqlens_k=self.cu_seqlens_k,
            max_seqlen_k=self.max_seqlen_k,
            softmax_scale=layer.scaling,
            causal=True,
            window_size=(-1, -1), 
        )

        output = output.view(-1, num_heads_q * head_dim_qk)  # [total_q, num_heads_q * head_dim_qk]

        return output

    def forward_decode(self, q, k, v, layer: nn.Module, input_metadata: InputMetadata):
        assert k.dtype == q.dtype, "k的dtype必须与q的dtype相同"
        assert v.dtype == q.dtype, "v的dtype必须与q的dtype相同"

        num_heads_q = layer.tp_q_head_num    
        head_dim_qk = layer.qk_head_dim  

        q = q.view(-1, num_heads_q, head_dim_qk)

        layer_id = layer.layer_id
        input_metadata.token_to_kv_pool.set_kv_buffer(
            layer_id, input_metadata.out_cache_loc, k, v
        )

        k_cache = input_metadata.token_to_kv_pool.get_key_buffer(layer_id)  # [total_cache_k, num_heads_k, head_dim_qk]
        v_cache = input_metadata.token_to_kv_pool.get_value_buffer(layer_id)  # [total_cache_v, num_heads_v, head_dim_v]


        if torch.distributed.get_rank() == 0:
            print('q.shape: ', q.shape)
            print('k_cache.shape: ', k_cache.shape)
            print('v_cache.shape: ', v_cache.shape)
            print('cu_seqlens_q: ', self.cu_seqlens_q)
            print('max_seqlen_q: ', self.max_seqlen_q)
            print('cu_seqlens_k: ', self.cu_seqlens_k)
            print('max_seqlen_k: ', self.max_seqlen_k)
            print('softmax_scale: ', layer.scaling)
            print('causal: ', True)
            print('window_size: ', (-1, -1))

        output = self.flash_attn_varlen_func(
            q,
            k_cache,
            v_cache,
            cu_seqlens_q=self.cu_seqlens_q,
            max_seqlen_q=self.max_seqlen_q,
            cu_seqlens_k=self.cu_seqlens_k,
            max_seqlen_k=self.max_seqlen_k,
            softmax_scale=layer.scaling,
            causal=True,
            window_size=(-1, -1),
        )
        if torch.distributed.get_rank() == 0:
            print('output.shape: ', output.shape)

        output = output.view(-1, num_heads_q * head_dim_qk)  # [total_q, num_heads_q * head_dim_qk]

        return output

But after starting the service, the program crashes due to illegal CUDA memory access after running a few forward_decode. The logs are below. I'm not quite sure what happened, so I need help.

[07:13:37 TP3] max_total_num_tokens=37340, max_prefill_tokens=4096, max_running_requests=4097, context_len=4096
[07:13:37 TP1] max_total_num_tokens=37340, max_prefill_tokens=4096, max_running_requests=4097, context_len=4096
[07:13:37 DP1 TP0] max_total_num_tokens=37340, max_prefill_tokens=4096, max_running_requests=4097, context_len=4096
[07:13:37 TP2] max_total_num_tokens=37340, max_prefill_tokens=4096, max_running_requests=4097, context_len=4096
INFO:     Started server process [240158]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     127.0.0.1:48704 - "GET /get_model_info HTTP/1.1" 200 OK
[07:13:38 DP0 TP0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
q.shape:  torch.Size([1, 16, 128])
k_cache.shape:  torch.Size([37341, 2, 128])
v_cache.shape:  torch.Size([37341, 2, 128])
cu_seqlens_q:  tensor([0, 7], device='cuda:0', dtype=torch.int32)
max_seqlen_q:  7
cu_seqlens_k:  tensor([0, 7], device='cuda:0', dtype=torch.int32)
max_seqlen_k:  7
softmax_scale:  0.08838834764831845
causal:  True
window_size:  (-1, -1)
output.shape:  torch.Size([1, 16, 128])
q.shape:  torch.Size([1, 16, 128])
k_cache.shape:  torch.Size([37341, 2, 128])
v_cache.shape:  torch.Size([37341, 2, 128])
cu_seqlens_q:  tensor([0, 7], device='cuda:0', dtype=torch.int32)
max_seqlen_q:  7
cu_seqlens_k:  tensor([0, 7], device='cuda:0', dtype=torch.int32)
max_seqlen_k:  7
softmax_scale:  0.08838834764831845
causal:  True
window_size:  (-1, -1)
output.shape:  torch.Size([1, 16, 128])
q.shape:  torch.Size([1, 16, 128])
k_cache.shape:  torch.Size([37341, 2, 128])
v_cache.shape:  torch.Size([37341, 2, 128])
cu_seqlens_q:  tensor([0, 7], device='cuda:0', dtype=torch.int32)
max_seqlen_q:  7
cu_seqlens_k:  tensor([0, 7], device='cuda:0', dtype=torch.int32)
max_seqlen_k:  7
softmax_scale:  0.08838834764831845
causal:  True
window_size:  (-1, -1)
output.shape:  torch.Size([1, 16, 128])
q.shape:  torch.Size([1, 16, 128])
k_cache.shape:  torch.Size([37341, 2, 128])
v_cache.shape:  torch.Size([37341, 2, 128])
cu_seqlens_q:  tensor([0, 7], device='cuda:0', dtype=torch.int32)
max_seqlen_q:  7
cu_seqlens_k:  tensor([0, 7], device='cuda:0', dtype=torch.int32)
max_seqlen_k:  7
softmax_scale:  0.08838834764831845
causal:  True
window_size:  (-1, -1)
output.shape:  torch.Size([1, 16, 128])
q.shape:  torch.Size([1, 16, 128])
k_cache.shape:  torch.Size([37341, 2, 128])
v_cache.shape:  torch.Size([37341, 2, 128])
cu_seqlens_q:  tensor([0, 7], device='cuda:0', dtype=torch.int32)
max_seqlen_q:  7
cu_seqlens_k:  tensor([0, 7], device='cuda:0', dtype=torch.int32)
max_seqlen_k:  7
softmax_scale:  0.08838834764831845
causal:  True
window_size:  (-1, -1)
[rank2]:[E1011 07:13:40.279863889 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fae3a377f86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fae3a326d10 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fae3a70ff08 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fadec57f3e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fadec584600 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fadec58b2ba in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fadec58d6fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7fae48fa4253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7faef33d0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7faef3462850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fae3a377f86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fae3a326d10 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fae3a70ff08 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fadec57f3e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fadec584600 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fadec58b2ba in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fadec58d6fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7fae48fa4253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7faef33d0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7faef3462850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fae3a377f86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7fadec216a84 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xdc253 (0x7fae48fa4253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x94ac3 (0x7faef33d0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x126850 (0x7faef3462850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[E1011 07:13:40.280314216 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fae3a377f86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fae3a326d10 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fae3a70ff08 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fadec57f3e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fadec584600 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fadec58b2ba in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fadec58d6fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7fae48fa4253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7faef33d0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7faef3462850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank3]:[E1011 07:13:40.280358573 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fae3a377f86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fae3a326d10 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fae3a70ff08 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fadec57f3e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fadec584600 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fadec58b2ba in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fadec58d6fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7fae48fa4253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7faef33d0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7faef3462850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'

serving command is:

python -m sglang.launch_server --model-path /mnt/qwen2_72b_sglang_fp8/ --port 8000 --tp 4 --dp 2 --kv-cache-dtype auto --context-length 4096 --mem-fraction-static 0.9  --host 0.0.0.0 --max-total-tokens 65536 --enable-mixed-chunk --chunked-prefill-size 2048 --max-prefill-tokens 4096  --schedule-policy fcfs --disable-cuda-graph --attention-backend flash_attention

merrymercy · 2024-10-11T13:14:56Z

merrymercy
Oct 11, 2024
Maintainer

Hi @BBuf, flash_attention backend is great and we would love to have it.

Can you try to always set use_ragged=True here and see whether it brings any performance improvement?

sglang/python/sglang/srt/layers/attention/flashinfer_backend.py

Lines 118 to 124 in f13d86f

    
           # Some heuristics to check whether to use ragged forward 
        
           use_ragged = False 
        
           if ( 
        
               torch.sum(forward_batch.seq_lens).item() >= 4096 
        
               and self.num_wrappers == 1 
        
           ): 
        
               use_ragged = True

Your implementation of flashinfer backend is not very efficient. Do you only notice slowdown of prefill or do you also notice slowdown of decode? If the slowdown is only during prefill, you can easily swap this ragged attention

sglang/python/sglang/srt/layers/attention/flashinfer_backend.py

Lines 233 to 240 in f13d86f

    
           o1, s1 = self.prefill_wrapper_ragged.forward_return_lse( 
        
               q.contiguous().view(-1, layer.tp_q_head_num, layer.head_dim), 
        
               k.contiguous().view(-1, layer.tp_k_head_num, layer.head_dim), 
        
               v.contiguous().view(-1, layer.tp_v_head_num, layer.head_dim), 
        
               causal=True, 
        
               sm_scale=layer.scaling, 
        
               logits_soft_cap=layer.logit_cap, 
        
           )

with flashattention and use flashinfer for decoding and other parts

Is flash attention integrated into pytorch? If so, can we use scaled dot product attention from pytorch?

cc @yzh119 for the bad case.

3 replies

BBuf Oct 11, 2024
Author

Hi @BBuf, flash_attention backend is great and we would love to have it.

1. Can you try to always set `use_ragged=True` here and see whether it brings any performance improvement? https://github.com/sgl-project/sglang/blob/f13d86f9209b62c701dcd12d08cad8f15c600fae/python/sglang/srt/layers/attention/flashinfer_backend.py#L118-L124

2. Your implementation of flashinfer backend is not very efficient. Do you only notice slowdown of prefill or do you also notice slowdown of decode? If the slowdown is only during prefill, you can easily swap this ragged attention https://github.com/sgl-project/sglang/blob/f13d86f9209b62c701dcd12d08cad8f15c600fae/python/sglang/srt/layers/attention/flashinfer_backend.py#L233-L240
    with flashattention and use flashinfer for decoding and other parts

3. Is flashinfer integrated into pytorch? If so, can we use scaled dot product attention from pytorch?

cc @yzh119 for the bad case.

Thanks! From the documentation of torch.nn.functional.scaled_dot_product_attention and the PyTorch source code, I did not find PyTorch integrated FlashInfer. I will try the other two suggestions later.

merrymercy Oct 11, 2024
Maintainer

sorry. It was a typo. I mean flash attention should be integrated in pytorch in (3)

BBuf Oct 12, 2024
Author

sorry. It was a typo. I mean flash attention should be integrated in pytorch in (3)

Ok, I will have a try.

yzh119 · 2024-10-11T19:20:42Z

yzh119
Oct 11, 2024
Collaborator

Hi @BBuf It's not a fair comparison because you use prefill attention on page table (BatchPrefillWithPagedKVCache) for flashinfer while using ragged tensor (flash_attn_varlen_func） for flashattention.

The BatchPrefillWithRaggedKVCache in flashinfer should have the same semantics and (similar) implementation to flash_attn_varlen_func. I remember sglang uses both BatchPrefillWithRaggedKVCache and BatchPrefillWithPagedKVCache but I'm not sure about how sglang dispatches these two implementations. @Ying1123 can you further clarify?

0 replies

BBuf · 2024-10-12T04:27:49Z

BBuf
Oct 12, 2024
Author

The test above was conducted in a custom-built PyTorch Docker container. After switching to an NGC Docker today, I did not find a significant difference in the computation time between flashinfer and flash_attention in this scenario. Flashinfer was only about 10% slower. Below are the computation times for the three different cases as shown in nsys: vllm flash_attention, sglang origin, and sgalng always ragged tensor.

vllm flash_attention:

total 81+52=133us.

sglang origin:

total 43 + 104 = 147us.

sgalng always ragged tensor

total 71 + 59 = 130us.

In summary, when deploying Llama3-8b on a single 4090 GPU and using qps=8 for requests on my dataset, I observed a significant difference in the computation time between flash attention and flashinfer, which was due to the fact that I was running in a manually compiled PyTorch Docker container. After switching to another NGC Docker, this difference became much smaller. Flashinfer's attention computation was only about 10% slower than flash attention. Furthermore, after making flashinfer always use ragged tensors, I found that their attention computation times were nearly the same.

Therefore, I personally feel that there is no need to add a new backend for flash attention. We can close this discussion and allow users to choose whether to use ragged tensors entirely. Based on the test results, the impact on throughput, ttf, and tpo is also relatively small. Specifically, ttf decreased from 0.081s to 0.078s, while there were no significant changes in tpo and throughput.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there any possible to support flash_attention backend? #1628

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Is there any possible to support flash_attention backend? #1628

BBuf Oct 11, 2024

Replies: 3 comments · 3 replies

merrymercy Oct 11, 2024 Maintainer

BBuf Oct 11, 2024 Author

merrymercy Oct 11, 2024 Maintainer

BBuf Oct 12, 2024 Author

yzh119 Oct 11, 2024 Collaborator

BBuf Oct 12, 2024 Author

BBuf
Oct 11, 2024

Replies: 3 comments 3 replies

merrymercy
Oct 11, 2024
Maintainer

BBuf Oct 11, 2024
Author

merrymercy Oct 11, 2024
Maintainer

BBuf Oct 12, 2024
Author

yzh119
Oct 11, 2024
Collaborator

BBuf
Oct 12, 2024
Author