Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About transpose processing in MultiHeadedAttention class. #118

Open
Tinghao-NTU opened this issue Nov 17, 2023 · 3 comments
Open

About transpose processing in MultiHeadedAttention class. #118

Tinghao-NTU opened this issue Nov 17, 2023 · 3 comments

Comments

@Tinghao-NTU
Copy link

Below is the forward function of the MultiHeadedAttention class:

    def forward(self, query, key, value, mask=None):
        "Implements Figure 2"
        if mask is not None:
            # Same mask applied to all h heads.
            mask = mask.unsqueeze(1)
        nbatches = query.size(0)

        # 1) Do all the linear projections in batch from d_model => h x d_k
        query, key, value = [
            lin(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
            for lin, x in zip(self.linears, (query, key, value))
        ]

        # 2) Apply attention on all the projected vectors in batch.
        x, self.attn = attention(
            query, key, value, mask=mask, dropout=self.dropout
        )

        # 3) "Concat" using a view and apply a final linear.
        x = (
            x.transpose(1, 2)
            .contiguous()
            .view(nbatches, -1, self.h * self.d_k)
        )
        del query
        del key
        del value
        return self.linears[-1](x)

I notice that the query, key, value is transposed (lin(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)') after passing through the linear layers. After calculating the attention, x' is then transposed back (`x.transpose(1, 2)').

May I know why we need such processing? Can we just use `lin(x).view(nbatches, -1, self.h, self.d_k)' and
x =x.contiguous().view(nbatches, -1, self.h * self.d_k)?

I delete all the transposing processing and the result is different. So I am wondering which one is correct, the original one with transpose, or the one without transpose.

@BillyChen123
Copy link

I have the same confusion with this code.

@gitfourteen
Copy link

gitfourteen commented Mar 22, 2024

Note that -1 represents the length $N_{token}$ (#token) of the current input or time steps of a sequence in a batch and the shape of attention scores for each head is the same, $N_{token} \times N_{token}$.

def attention(query, key, value, mask=None, dropout=None):
    "Compute 'Scaled Dot Product Attention'"
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    p_attn = scores.softmax(dim=-1)
    if dropout is not None:
        p_attn = dropout(p_attn)
    return torch.matmul(p_attn, value), p_attn

@PangLuo
Copy link

PangLuo commented Dec 4, 2024

Can we just use `lin(x).view(nbatches, -1, self.h, self.d_k)' and
x =x.contiguous().view(nbatches, -1, self.h * self.d_k)?

I don't think so.

With your suggestion the resulting shape of query/key/value will be
(batch_size, seq_len, h, d_k)

We have
scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
in the attention function.

torch.matmul(query, key.transpose(-2, -1)) will then be the matrix multiplication of
(batch_size, seq_len, h, d_k) and (batch_size, seq_len, d_k, h). While the shapes of the two matrices match, this doesn't denote scores.
The scores should be a product of (batch_size, h, seq_len, d_k) and (batch_size, h, d_k, seq_len). This deals with the score calculation for each head correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants