Softmax in the self-attention #1231

desh2608 · 2023-08-30T15:21:14Z

desh2608
Aug 30, 2023
Collaborator

Some implementations of self-attention use a modified version of softmax which has an extra logit, such as here.

The motivation is that when we use the regular softmax for each attention head, we are basically forcing every head to make a decision, even if it has no information to add to the output vector. The solution is to add 1 to the denominator to the softmax computation, which is equivalent to an extra virtual logit equal to 0. This would allow some attention heads to be "quiet". This blog post explains this in some detail: https://www.evanmiller.org/attention-is-off-by-one.html.

In icefall, an implementation would look like the following:

def softmax_with_extra_logit(x: Tensor, dim: int):
    """
    Based on: https://github.com/google/flaxformer/blob/ee62754ebe5a5eeb111493622de5537133822e3e/flaxformer/components/attention/dense_attention.py#L50
    """
    m, _ = torch.max(x, dim, keepdim=True)
    m = torch.maximum(m, torch.zeros_like(m))
    unnormalized = torch.exp(x - m)
    # After shift, extra logit is -m. Add exp(-m) to denominator
    denom = unnormalized.sum(dim, keepdim=True) + torch.exp(-m)
    return unnormalized / denom

And then it can be used to replace the regular softmax used in the zipformer self-attention.

I did some quick experiments with this change on TED-LIUM and got small improvements (I did not change anything else):

Model	Dev	Test
Zipformer	6.38	5.95
+ new softmax	6.30	5.88

marcoyang1998 · 2023-08-30T15:49:18Z

marcoyang1998
Aug 30, 2023
Maintainer

This is interesting, have you tried this on other datasets?

3 replies

desh2608 Aug 30, 2023
Collaborator Author

Unfortunately, I didn't have bandwidth to do more experiments.

desh2608 Aug 30, 2023
Collaborator Author

If someone wants to try this with zipformer, the only changes would be here and here.

danpovey Aug 31, 2023
Maintainer

I think I tried this at some point in the past and did not see improvements, but that might be different in our newer models, it's worth a try.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Softmax in the self-attention #1231

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Softmax in the self-attention #1231

desh2608 Aug 30, 2023 Collaborator

Replies: 1 comment · 3 replies

marcoyang1998 Aug 30, 2023 Maintainer

desh2608 Aug 30, 2023 Collaborator Author

desh2608 Aug 30, 2023 Collaborator Author

danpovey Aug 31, 2023 Maintainer

desh2608
Aug 30, 2023
Collaborator

Replies: 1 comment 3 replies

marcoyang1998
Aug 30, 2023
Maintainer

desh2608 Aug 30, 2023
Collaborator Author

desh2608 Aug 30, 2023
Collaborator Author

danpovey Aug 31, 2023
Maintainer