Replies: 1 comment 3 replies
-
This is interesting, have you tried this on other datasets? |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Some implementations of self-attention use a modified version of softmax which has an extra logit, such as here.
The motivation is that when we use the regular softmax for each attention head, we are basically forcing every head to make a decision, even if it has no information to add to the output vector. The solution is to add 1 to the denominator to the softmax computation, which is equivalent to an extra virtual logit equal to 0. This would allow some attention heads to be "quiet". This blog post explains this in some detail: https://www.evanmiller.org/attention-is-off-by-one.html.
In icefall, an implementation would look like the following:
And then it can be used to replace the regular softmax used in the zipformer self-attention.
I did some quick experiments with this change on TED-LIUM and got small improvements (I did not change anything else):
Beta Was this translation helpful? Give feedback.
All reactions