[feature request] qwen model's query logn-scaling attn #836

handoku · 2024-01-08T07:47:36Z

Qwen use qwen-style dynamic ntk and logn-scaling to generate better text in case of long context text input.

The trtllm implementation of qwen does not support logn-scaling right now, which result in low quality outputs.

I would like to provide a implementation. However, its a little diffcult for me to understand the gpt_attention.

My vanilla thought is multiplying q tensor with logn tensor before call gpt_attention. But every seq_len_idx value of q tensor is needed for caculating log_{seq_len_trained}(seq_len_idx). I don't know how to get seq_len_idx value, especially in packed tensor mode.

Would you please give some help on this？Is there a convenient way to achieve this(even in a dirty hard-code way)?

The text was updated successfully, but these errors were encountered:

Tlntin · 2024-01-08T10:35:58Z

like this, I finished it. commit link

c++ Implement the code: link

handoku · 2024-01-08T12:21:23Z

like this, I finished it. commit link

c++ Implement the code: link

@Tlntin Hi, thank you for reply

I am using trtllm release v0.7.0 and qwen 7b 1.0, I added another RotaryScalingType type and modified update_rotary_base_n_scale func directly to support qwen-stype dynamic ntk. as follow

inline __device__ float update_rotary_base_dynamic_ntk(
    const int kv_seq_len, const int max_positions, const int embed_dim, const float base, const float scale)
{
    const float ntk_alpha = exp2f(ceilf(log2f(1.f * kv_seq_len / max_positions) + 1.f)) - 1.f;
    return base * powf(ntk_alpha, embed_dim / (embed_dim- 2.f));
}

inline __device__ void update_rotary_base_n_scale(float& base, float& scale, RotaryScalingType const scale_type,
    const int rot_embed_dim, const int max_positions, const int seq_len)
{
    // only update the base and/or scale if needed based on scale_type
    if (scale_type == RotaryScalingType::kDYNAMIC)
    {
        if (seq_len > max_positions)
        {
            base = update_rotary_base(seq_len, max_positions, rot_embed_dim, base, scale);
        }
        scale = 1.0f; // scale is only used in base for dynamic scaling
    } 
    else if(scale_type == RotaryScalingType::kDYNAMIC_NTK_QWEN){
        if (seq_len > max_positions)
        {
            base = update_rotary_base_dynamic_ntk(seq_len, max_positions, rot_embed_dim, base, scale);
        }
        scale = 1.0f; // scale is only used in base for dynamic scaling
    }
    else if (scale_type == RotaryScalingType::kLINEAR)
    {
        scale = 1.0f / scale;
    }
}

After saw the code in main branch, I am not sure my modification is right or not, there are some other places invoking update_rotary_base_n_scale. why added another update_dynamic_scaling_rotary entry, which make me more confused= = .

Nevertheless, what you post looks like only about rope base update. logn attn is still missing, I have tested and find that removing logn-scaling will hurt the performance of qwen agent

I find some commented code about logn scaling in your repository, but it looks like not compatible with packed tensor mode. When enable paged attention, the qkv tensor's shape is [1, num_tokens, qkv_dim], is that right？

Tlntin · 2024-01-09T02:25:10Z

it seems your C++ code may work better, more similar to raw pytorch code!
About logn_scale, I think you may need do C++ implementation as same!
you need to change plugin code.
TensorRT-LLM/cpp/tensorrt_llm/plugins/gptAttentionPlugin/gptAttentionPlugin.cpp at 6cc5e177ff2fb60b1aab3b03fa0534b5181cf0f1 · NVIDIA/TensorRT-LLM (github.com)

there has two function
enqueueContext/enqueueGeneration

TensorRT-LLM/cpp/tensorrt_llm/plugins/gptAttentionCommon/gptAttentionCommon.cpp at 6cc5e177ff2fb60b1aab3b03fa0534b5181cf0f1 · NVIDIA/TensorRT-LLM (github.com)

I think logn scale implementation may has some difficulty.

my code logn_scaling can only use without gpt attention plugin, but seems it work not well, so I commented it!

Lzhang-hub · 2024-01-23T07:57:27Z

The trtllm implementation of qwen does not support logn-scaling right now, which result in different infer result.

@handoku I have same question,do you hace any update? thanks

handoku · 2024-01-24T07:15:05Z

The trtllm implementation of qwen does not support logn-scaling right now, which result in different infer result.

@handoku I have same question,do you hace any update? thanks

sry, no pregress yet. Maybe make the trtllm team more aware of this painful problem and help to solve it could save us.

Njuapp · 2024-09-18T06:29:32Z

@Tlntin have you ever tested TRT-LLM Qwen1 on long input? I found it is empty output for inputs as long as 6K (smaller than 8K,the training length).

Tlntin · 2024-09-18T06:35:19Z

@Tlntin have you ever tested TRT-LLM Qwen1 on long input? I found it is empty output for inputs as long as 6K (smaller than 8K,the training length).

i tested, it work well, you need to same change like above.

Njuapp · 2024-09-18T10:19:27Z

@Tlntin have you ever tested TRT-LLM Qwen1 on long input? I found it is empty output for inputs as long as 6K (smaller than 8K,the training length).

i tested, it work well, you need to same change like above.

Did you mean use changes in your commit

Tlntin · 2024-09-18T10:20:51Z

Did you mean use changes in your commit

yes

Njuapp · 2024-12-11T10:04:04Z

It is supported on Today's update

Njuapp · 2024-12-11T10:04:16Z

#2562

byshiue assigned ncomly-nvidia Jan 8, 2024

byshiue added the feature request New feature or request label Jan 8, 2024

handoku closed this as completed Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature request] qwen model's query logn-scaling attn #836

[feature request] qwen model's query logn-scaling attn #836

handoku commented Jan 8, 2024 •

edited

Loading

Tlntin commented Jan 8, 2024

handoku commented Jan 8, 2024 •

edited

Loading

Tlntin commented Jan 9, 2024

Lzhang-hub commented Jan 23, 2024

handoku commented Jan 24, 2024

Njuapp commented Sep 18, 2024

Tlntin commented Sep 18, 2024

Njuapp commented Sep 18, 2024

Tlntin commented Sep 18, 2024

Njuapp commented Dec 11, 2024

Njuapp commented Dec 11, 2024

[feature request] qwen model's query logn-scaling attn #836

[feature request] qwen model's query logn-scaling attn #836

Comments

handoku commented Jan 8, 2024 • edited Loading

Tlntin commented Jan 8, 2024

handoku commented Jan 8, 2024 • edited Loading

Tlntin commented Jan 9, 2024

Lzhang-hub commented Jan 23, 2024

handoku commented Jan 24, 2024

Njuapp commented Sep 18, 2024

Tlntin commented Sep 18, 2024

Njuapp commented Sep 18, 2024

Tlntin commented Sep 18, 2024

Njuapp commented Dec 11, 2024

Njuapp commented Dec 11, 2024

handoku commented Jan 8, 2024 •

edited

Loading

handoku commented Jan 8, 2024 •

edited

Loading