Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] qwen model's query logn-scaling attn #836

Closed
handoku opened this issue Jan 8, 2024 · 11 comments
Closed

[feature request] qwen model's query logn-scaling attn #836

handoku opened this issue Jan 8, 2024 · 11 comments
Assignees
Labels
feature request New feature or request

Comments

@handoku
Copy link

handoku commented Jan 8, 2024

Qwen use qwen-style dynamic ntk and logn-scaling to generate better text in case of long context text input.

The trtllm implementation of qwen does not support logn-scaling right now, which result in low quality outputs.

I would like to provide a implementation. However, its a little diffcult for me to understand the gpt_attention.

My vanilla thought is multiplying q tensor with logn tensor before call gpt_attention. But every seq_len_idx value of q tensor is needed for caculating logseq_len_trained(seq_len_idx). I don't know how to get seq_len_idx value, especially in packed tensor mode.

Would you please give some help on this?Is there a convenient way to achieve this(even in a dirty hard-code way)?

@Tlntin
Copy link
Contributor

Tlntin commented Jan 8, 2024

like this, I finished it. commit link

c++ Implement the code: link

@handoku
Copy link
Author

handoku commented Jan 8, 2024

like this, I finished it. commit link

c++ Implement the code: link

@Tlntin Hi, thank you for reply

I am using trtllm release v0.7.0 and qwen 7b 1.0, I added another RotaryScalingType type and modified update_rotary_base_n_scale func directly to support qwen-stype dynamic ntk. as follow

inline __device__ float update_rotary_base_dynamic_ntk(
    const int kv_seq_len, const int max_positions, const int embed_dim, const float base, const float scale)
{
    const float ntk_alpha = exp2f(ceilf(log2f(1.f * kv_seq_len / max_positions) + 1.f)) - 1.f;
    return base * powf(ntk_alpha, embed_dim / (embed_dim- 2.f));
}

inline __device__ void update_rotary_base_n_scale(float& base, float& scale, RotaryScalingType const scale_type,
    const int rot_embed_dim, const int max_positions, const int seq_len)
{
    // only update the base and/or scale if needed based on scale_type
    if (scale_type == RotaryScalingType::kDYNAMIC)
    {
        if (seq_len > max_positions)
        {
            base = update_rotary_base(seq_len, max_positions, rot_embed_dim, base, scale);
        }
        scale = 1.0f; // scale is only used in base for dynamic scaling
    } 
    else if(scale_type == RotaryScalingType::kDYNAMIC_NTK_QWEN){
        if (seq_len > max_positions)
        {
            base = update_rotary_base_dynamic_ntk(seq_len, max_positions, rot_embed_dim, base, scale);
        }
        scale = 1.0f; // scale is only used in base for dynamic scaling
    }
    else if (scale_type == RotaryScalingType::kLINEAR)
    {
        scale = 1.0f / scale;
    }
}

After saw the code in main branch, I am not sure my modification is right or not, there are some other places invoking update_rotary_base_n_scale. why added another update_dynamic_scaling_rotary entry, which make me more confused= = .

Nevertheless, what you post looks like only about rope base update. logn attn is still missing, I have tested and find that removing logn-scaling will hurt the performance of qwen agent

I find some commented code about logn scaling in your repository, but it looks like not compatible with packed tensor mode. When enable paged attention, the qkv tensor's shape is [1, num_tokens, qkv_dim], is that right?

@byshiue byshiue added the feature request New feature or request label Jan 8, 2024
@Tlntin
Copy link
Contributor

Tlntin commented Jan 9, 2024

it seems your C++ code may work better, more similar to raw pytorch code!
About logn_scale, I think you may need do C++ implementation as same!
you need to change plugin code.
TensorRT-LLM/cpp/tensorrt_llm/plugins/gptAttentionPlugin/gptAttentionPlugin.cpp at 6cc5e177ff2fb60b1aab3b03fa0534b5181cf0f1 · NVIDIA/TensorRT-LLM (github.com)

there has two function
enqueueContext/enqueueGeneration

TensorRT-LLM/cpp/tensorrt_llm/plugins/gptAttentionCommon/gptAttentionCommon.cpp at 6cc5e177ff2fb60b1aab3b03fa0534b5181cf0f1 · NVIDIA/TensorRT-LLM (github.com)

TensorRT-LLM/cpp/tensorrt_llm/plugins/gptAttentionCommon/gptAttentionCommon.cpp at 6cc5e177ff2fb60b1aab3b03fa0534b5181cf0f1 · NVIDIA/TensorRT-LLM (github.com)

I think logn scale implementation may has some difficulty.

my code logn_scaling can only use without gpt attention plugin, but seems it work not well, so I commented it!

@Lzhang-hub
Copy link

The trtllm implementation of qwen does not support logn-scaling right now, which result in different infer result.

@handoku I have same question,do you hace any update? thanks

@handoku
Copy link
Author

handoku commented Jan 24, 2024

The trtllm implementation of qwen does not support logn-scaling right now, which result in different infer result.

@handoku I have same question,do you hace any update? thanks

sry, no pregress yet. Maybe make the trtllm team more aware of this painful problem and help to solve it could save us.

@Njuapp
Copy link

Njuapp commented Sep 18, 2024

@Tlntin have you ever tested TRT-LLM Qwen1 on long input? I found it is empty output for inputs as long as 6K (smaller than 8K,the training length).

@Tlntin
Copy link
Contributor

Tlntin commented Sep 18, 2024

@Tlntin have you ever tested TRT-LLM Qwen1 on long input? I found it is empty output for inputs as long as 6K (smaller than 8K,the training length).

i tested, it work well, you need to same change like above.

@Njuapp
Copy link

Njuapp commented Sep 18, 2024

@Tlntin have you ever tested TRT-LLM Qwen1 on long input? I found it is empty output for inputs as long as 6K (smaller than 8K,the training length).

i tested, it work well, you need to same change like above.

Did you mean use changes in your commit

@Tlntin
Copy link
Contributor

Tlntin commented Sep 18, 2024

Did you mean use changes in your commit

yes

@Njuapp
Copy link

Njuapp commented Dec 11, 2024

It is supported on Today's update

@Njuapp
Copy link

Njuapp commented Dec 11, 2024

#2562

@handoku handoku closed this as completed Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants