Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the cached layernorm scale factors #696

Open
Meehaohao opened this issue Aug 7, 2024 · 2 comments
Open

About the cached layernorm scale factors #696

Meehaohao opened this issue Aug 7, 2024 · 2 comments

Comments

@Meehaohao
Copy link

Meehaohao commented Aug 7, 2024

Question

About the 'apply_ln_to_stack' function in 'ActivationCache.py' file, what's the meaning about the following sentence?
"The layernorm scale is global across the entire residual stream for each layer, batch element and position, which is why we need to use the cached scale factors rather than just applying a new LayerNorm."

That is, why we need to use the cached scale factors(In layer_norm, the scale factor is mean and std)of the original reasoning,rather than normalizing the variables directly in different layers?

@neelnanda-io
Copy link
Collaborator

neelnanda-io commented Aug 7, 2024 via email

@Meehaohao
Copy link
Author

Because we want to use the scale factor of the FINAL residual stream to scale COMPONENTS of the residual stream, and you can't infer the final norm from partial components

On Wed, 7 Aug 2024, 05:20 Mi Hao, @.> wrote: Question About the 'apply_ln_to_stack' function in 'ActivationCache.py' file, what's the meaning about: " The layernorm scale is global across the entire residual stream for each layer, batch element and position, which is why we need to use the cached scale factors rather than just applying a new LayerNorm." ? That is, why we need to use the cached scale factors(In layer_norm, the scale factor is mean and std)of the original reasoning, instead of normalizing the variables directly? — Reply to this email directly, view it on GitHub <#696>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASRPNKLH6L6IYJUPB3FWUWDZQIGKPAVCNFSM6AAAAABMEJKWBWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQ2TGMZUG44DGOA . You are receiving this because you are subscribed to this thread.Message ID: @.>

因为我们希望使用最终残差流的比例因子来缩放残差流的分量,而你无法从部分分量推断出最终范数

2024 年 8 月 7 日星期三 05:20,Mi Hao,@.> 写道:问题关于 'ActivationCache.py' 文件中的 'apply_ln_to_stack' 函数,其中“layernorm 尺度对于每个层、批次元素和位置的整个残差流都是全局的,这就是为什么我们需要使用缓存的尺度因子而不是仅仅应用一个新的 LayerNorm”是什么意思?也就是说,为什么我们需要使用原始推理的缓存尺度因子(在 layer_norm 中,尺度因子是均值和标准差),而不是直接对变量进行归一化? — 直接回复此电子邮件,在 GitHub 上查看 < #696 >,或取消订阅 < https://github.com/notifications/unsubscribe-auth/ASRPNKLH6L6IYJUPB3FWUWDZQIGKPAVCNFSM6AAAAABMEJKWBWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQ2TGMZUG44DGOA > 。您收到此邮件是因为您订阅了此主题。消息 ID:@.>

So, does Scale Factor have some meanings?
I think it's just a operation for normalization. And it is related to the input sentence rather than the parameters that LLM learned.
If so, may be we can directly normalize the COMPONENTS of the residual stream(For LN, subtract the mean and divide by the variance),which might also be reasonable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants