About the cached layernorm scale factors #696

Meehaohao · 2024-08-07T12:20:34Z

Question

About the 'apply_ln_to_stack' function in 'ActivationCache.py' file, what's the meaning about the following sentence？
"The layernorm scale is global across the entire residual stream for each layer, batch element and position, which is why we need to use the cached scale factors rather than just applying a new LayerNorm."

That is, why we need to use the cached scale factors（In layer_norm, the scale factor is mean and std）of the original reasoning，rather than normalizing the variables directly in different layers?

neelnanda-io · 2024-08-07T17:24:05Z

Because we want to use the scale factor of the FINAL residual stream to scale COMPONENTS of the residual stream, and you can't infer the final norm from partial components

…

On Wed, 7 Aug 2024, 05:20 Mi Hao, ***@***.***> wrote: Question About the 'apply_ln_to_stack' function in 'ActivationCache.py' file, what's the meaning about: " The layernorm scale is global across the entire residual stream for each layer, batch element and position, which is why we need to use the cached scale factors rather than just applying a new LayerNorm." ? That is, why we need to use the cached scale factors（In layer_norm, the scale factor is mean and std）of the original reasoning， instead of normalizing the variables directly? — Reply to this email directly, view it on GitHub <#696>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ASRPNKLH6L6IYJUPB3FWUWDZQIGKPAVCNFSM6AAAAABMEJKWBWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQ2TGMZUG44DGOA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Meehaohao · 2024-08-08T06:53:13Z

Because we want to use the scale factor of the FINAL residual stream to scale COMPONENTS of the residual stream, and you can't infer the final norm from partial components
…
On Wed, 7 Aug 2024, 05:20 Mi Hao, @.> wrote: Question About the 'apply_ln_to_stack' function in 'ActivationCache.py' file, what's the meaning about: " The layernorm scale is global across the entire residual stream for each layer, batch element and position, which is why we need to use the cached scale factors rather than just applying a new LayerNorm." ? That is, why we need to use the cached scale factors（In layer_norm, the scale factor is mean and std）of the original reasoning， instead of normalizing the variables directly? — Reply to this email directly, view it on GitHub <#696>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASRPNKLH6L6IYJUPB3FWUWDZQIGKPAVCNFSM6AAAAABMEJKWBWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQ2TGMZUG44DGOA . You are receiving this because you are subscribed to this thread.Message ID: @.>

因为我们希望使用最终残差流的比例因子来缩放残差流的分量，而你无法从部分分量推断出最终范数
…
2024 年 8 月 7 日星期三 05:20，Mi Hao，@.> 写道：问题关于 'ActivationCache.py' 文件中的 'apply_ln_to_stack' 函数，其中“layernorm 尺度对于每个层、批次元素和位置的整个残差流都是全局的，这就是为什么我们需要使用缓存的尺度因子而不是仅仅应用一个新的 LayerNorm”是什么意思？也就是说，为什么我们需要使用原始推理的缓存尺度因子（在 layer_norm 中，尺度因子是均值和标准差），而不是直接对变量进行归一化？ — 直接回复此电子邮件，在 GitHub 上查看 < #696 >，或取消订阅 < https://github.com/notifications/unsubscribe-auth/ASRPNKLH6L6IYJUPB3FWUWDZQIGKPAVCNFSM6AAAAABMEJKWBWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQ2TGMZUG44DGOA > 。您收到此邮件是因为您订阅了此主题。消息 ID：@.>

So, does Scale Factor have some meanings?
I think it's just a operation for normalization. And it is related to the input sentence rather than the parameters that LLM learned.
If so, may be we can directly normalize the COMPONENTS of the residual stream（For LN, subtract the mean and divide by the variance)，which might also be reasonable?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About the cached layernorm scale factors #696

About the cached layernorm scale factors #696

Meehaohao commented Aug 7, 2024 •

edited

Loading

neelnanda-io commented Aug 7, 2024 via email

Meehaohao commented Aug 8, 2024

About the cached layernorm scale factors #696

About the cached layernorm scale factors #696

Comments

Meehaohao commented Aug 7, 2024 • edited Loading

Question

neelnanda-io commented Aug 7, 2024 via email

Meehaohao commented Aug 8, 2024

Meehaohao commented Aug 7, 2024 •

edited

Loading