What is the purpose of `c_proj` here? #135

brynhayder · 2024-04-10T09:35:37Z

Line 42 in 37baab7

self.c_proj = nn.Linear(config.n_embd, config.n_embd)

Why do we need an additional linear transformation after the MHA and before the MLP when the dimensions are the same?

(I understand that this is how the initial transformer implementation was written, but I took this operation to be for dimension consistency between sequential attention operations. It seems superfluous here since the first linear in the MLP can already take linear combinations of the attention outputs.)

theicfire · 2024-05-06T01:47:39Z

I think the point is that the W^O can change the dimensionality, if the output of the concatenation is large/small (i.e. if dk or dv was a different dimension). Though the paper ultimately used parameters to make W^O square. It maybe helped the authors with trying out other parameters.

Related: #118 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the purpose of `c_proj` here? #135

What is the purpose of `c_proj` here? #135

brynhayder commented Apr 10, 2024 •

edited

Loading

theicfire commented May 6, 2024 •

edited

Loading

What is the purpose of c_proj here? #135

What is the purpose of c_proj here? #135

Comments

brynhayder commented Apr 10, 2024 • edited Loading

theicfire commented May 6, 2024 • edited Loading

What is the purpose of `c_proj` here? #135

What is the purpose of `c_proj` here? #135

brynhayder commented Apr 10, 2024 •

edited

Loading

theicfire commented May 6, 2024 •

edited

Loading