You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks a lot for open-sourcing the model - it's working really welll! I've been looking a bit through the code base and I was surprised to see that the attention layer here:
Hi, @patrickvonplaten! Sorry for late reply and thank you very much for pointing that issue out!
Actually, this is the form of compute-and-memory-efficient attention mechanism called Efficient Attention. Mathematically, it is claimed to be approximately equivalent to the classical dot-product attention.
However, unfortunately, we noticed that we missed taking softmax of the query vectors, our bad. Nonetheless, at the same time, taking softmax is just a form of normalization, so no surprise it worked out of the box as well.
Hey @ivanvovk et al.
Thanks a lot for open-sourcing the model - it's working really welll! I've been looking a bit through the code base and I was surprised to see that the attention layer here:
Speech-Backbones/Grad-TTS/model/diffusion.py
Line 95 in b82fdd5
computes the softmax on the projected key values instead of computing it on the product of query and key.
Usually, I know self-attention as:
Value x Softmax(Query x Key^T / d_k)
but it seems like here it is
(Value x Softmax(Key)) x Query
=> Is it similar to self-attention? Where does it come from?
Best,
Patrick
The text was updated successfully, but these errors were encountered: