-
Notifications
You must be signed in to change notification settings - Fork 316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About the cached layernorm scale factors #696
Comments
Because we want to use the scale factor of the FINAL residual stream to
scale COMPONENTS of the residual stream, and you can't infer the final norm
from partial components
…On Wed, 7 Aug 2024, 05:20 Mi Hao, ***@***.***> wrote:
Question
About the 'apply_ln_to_stack' function in 'ActivationCache.py' file,
what's the meaning about:
" The layernorm scale is global across the entire residual stream for each
layer, batch element and position, which is why we need to use the cached
scale factors rather than just applying a new LayerNorm." ?
That is, why we need to use the cached scale factors(In layer_norm, the
scale factor is mean and std)of the original reasoning, instead of
normalizing the variables directly?
—
Reply to this email directly, view it on GitHub
<#696>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ASRPNKLH6L6IYJUPB3FWUWDZQIGKPAVCNFSM6AAAAABMEJKWBWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQ2TGMZUG44DGOA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
So, does Scale Factor have some meanings? |
Question
About the 'apply_ln_to_stack' function in 'ActivationCache.py' file, what's the meaning about the following sentence?
"The layernorm scale is global across the entire residual stream for each layer, batch element and position, which is why we need to use the cached scale factors rather than just applying a new LayerNorm."
That is, why we need to use the cached scale factors(In layer_norm, the scale factor is mean and std)of the original reasoning,rather than normalizing the variables directly in different layers?
The text was updated successfully, but these errors were encountered: