Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hackerllama/blog/posts/random_transformer/ #3

Open
utterances-bot opened this issue Jan 2, 2024 · 20 comments
Open

hackerllama/blog/posts/random_transformer/ #3

utterances-bot opened this issue Jan 2, 2024 · 20 comments

Comments

@utterances-bot
Copy link

hackerllama - The Random Transformer

Understand how transformers work by demystifying all the math behind them

https://osanseviero.github.io/hackerllama/blog/posts/random_transformer/

Copy link

saeedzou commented Jan 2, 2024

Great content. I love these simplifications of complex topics such as transformers. I would like to see more of this.
One thing I like to add is if the encoder decoder attention code is correct.
Shouldn't 6. Encoder-decoder attention

Z_encoder_decoder = multi_head_encoder_decoder_attention(
    E, Z_self_attention, WQs, WKs, WVs
)

be instead

Z_encoder_decoder = multi_head_encoder_decoder_attention(
    encoder_output, Z_self_attention, WQs, WKs, WVs
)

@osanseviero
Copy link
Owner

Thanks! That's correct, I fixed it in 1a380ec

Copy link

ijbo commented Jan 3, 2024

Thanks for the post and explanation :

Though I am still reading the article - My question is regarding this para :
"The first step is to turn each input token into a vector using an embedding algorithm. This is a learned encoding. Usually we use a big vector size such as 512, but let’s do 4 for our example so we can keep the maths manageable."

Hello -> [1,2,3,4] World -> [2,3,4,5]

ques 1-> What is the embedding algorithm ? to get the Hello -> [1,2,3,4] World -> [2,3,4,5].
ques 2 -> Or to start with this is initialized (Hello -> [1,2,3,4] World -> [2,3,4,5]) with some random values and improved by the feed forward network as you have mentioned "This is a learned encoding" ?
ques 3 -> As per the paper what is their choice ? and why they have not used a word2vec or some already good embedding algorithm ? rather than learning it from scratch.

Copy link

Ijbo asks an interesting question. My answer -as of now with a novice's understanding- would be that in the case where one is creating a foundation model, the token==> vector embedding function itself (and its canonical extension to the vocabulary) is one of the main results of the training process. We could use this page to discuss this.

@osanseviero
Copy link
Owner

osanseviero commented Jan 4, 2024

Hey @ijbo and @edmundronald . I updated the first section by adding a step 0 (tokenization) and significantly providing more context in the embedding step. Please let me know if that helps clarify things!

Although we could use a pre-trained word embedding (such as word2vec), transformers learn the best embedding for their specific task/data as part of their training process. This will help obtain the best representation for the task!

Thanks for the feedback!

Copy link

ijbo commented Jan 4, 2024

Thanks for the update and explanation :
Reg the next section :Positional encoding

ques1 -> Can you please describe what is dmodel ? and its significance in the denominator ?
ques2 -> What is the contextual meaning of this (pos/ 10000 2i/dmodel) wrt positional information ?
ques3 -> " i " in the formula as per the above explanation means the position of Characters in "HELLO" ?

Copy link

Great doc -- really useful!
Got a quick q about the decoder -- your sequence variable gets re-assigned each time through the loop as just the embedding of the most recent token. But if I understand correctly, the decoder is always given multiple embeddings (with masks used to stop it attending to things in the future). Can you clarify please? If the decoder is given more than one row at once, how does it convert the its output into a single set of probabilities? Thanks!

Copy link

Great article!

@osanseviero
Copy link
Owner

Hi @simonrnss! Thanks for the question! I fixed a bug + improved the last section to be clearer. I recommend re-reading the "Generating the output sequence" section or looking at the commit

Copy link

ijbo commented Jan 4, 2024

This explains, the choice of the positional embedding function -> https://youtu.be/1biZfFLPRSY

Copy link

jbash commented Jan 4, 2024

The embedding above has no information about the position of the word in the sentence

Sure it does. It unambiguously encodes the exact position of every token. The first token is the first row, and the second token is the second row. You might want to explain at this point why that's not usable by the following steps.

Copy link

Thanks @osanseviero -- that matches my understanding better now. For anyone else looking at that section, it's also useful to look at the definition of the greedy_decode method in the annotated transformer (https://nlp.seas.harvard.edu/2018/04/03/attention.html)

Copy link

It looks great! Thank you!. How about training? Can you please clarify that in another post? Especially on how K,Q,V are updated

@Nikolaj-K
Copy link

Spotted what is probably a typo, namely in the sentence
"These brings some interesting properties"

On the nose, the sentence "The embedding above has no information about the position of the word in the sentence" is a bit strange, given the matrix E reflects the order of the words. From E you'd still know it, but it's not part of the matrix entry values themselves.

Finally, maybe you can tweak the page such that the section counting doesn't restart at each chapter.

Best.

@simonrnss
Copy link

Spotted what is probably a typo, namely in the sentence "These brings some interesting properties"

On the nose, the sentence "The embedding above has no information about the position of the word in the sentence" is a bit strange, given the matrix E reflects the order of the words. From E you'd still know it, but it's not part of the matrix entry values themselves.

Finally, maybe you can tweak the page such that the section counting doesn't restart at each chapter.

Best.

FWIW, I guess a wording like "The individual embeddings in the matrix contain no information about the position of the words in the sentence" would be clearer for that bit.

@osanseviero
Copy link
Owner

Thanks all! I clarified that bit; thanks for the feedback/ideas

Copy link

Good detailed intuitive post. The most important part was to do it using simple Numpy. The simplest possible implementation I have ever seen so far.
However, I have a concern in the last "Generating the output sequence" part. With the random nature of everything and with a smaller vocab size, the generated results seems random rather than 'translated'. If I add more non-relevant words in the vocab of either languages, the generation will be a little bit more non-relevant, even with a relatively larger sizes of embedding and matrices. I faced this issue when I tried with a different language translation.
Does this make sense?

@osanseviero
Copy link
Owner

Thanks for the positive feedback @nikilpatel94

It makes sense that the output is random - we randomly initialized all weights. I chose a small vocabulary with cherry-picked words, and hence the probability of a relevant output is higher. But it's still random!

In practice, all model weights will be learned with a dataset. To do something like that, you can implement the whole network in PyTorch, as in https://nlp.seas.harvard.edu/annotated-transformer/ , for a deeper dive into training code with simple implementation. If you know a bit of PyTorch and read this blog post, you should be able to jump into the annotated transformer

Copy link

scores1

array([[4.67695573e-10, 1.00000000e+00],
[1.11377182e-12, 1.00000000e+00]])

attention1 = scores1 @ V1
attention1

array([[7.99, 8.84, 6.84],
[7.99, 8.84, 6.84]])

scores1 has different shape with V1

Copy link

I think this is a very good post. I don't understand it in the mathematical part yet, but I'm trying to understand it.
I left a post because of a question and error while executing the sample code.
You wrote in Linear layer
def linear(x, W, b):
return np.dot(x, W) + b
and
x = linear([1, 0, 1, 0], np.random.randn(4, 10), np.random.randn(10))
but in this case, I met error in softmax(x)

AxisError Traceback (most recent call last)
Cell In[48], line 1
----> 1 softmax(x)
Cell In[10], line 2, in softmax(x)
1 def softmax(x):
----> 2 return np.exp(x) / np.sum(np.exp(x), axis=1, keepdims=True)
File ~/venv/lib/python3.10/site-packages/numpy/core/fromnumeric.py:2313, in sum(a, axis, dtype, out, keepdims, initial, where)
2310 return out
2311 return res
-> 2313 return _wrapreduction(a, np.add, 'sum', axis, dtype, out, keepdims=keepdims,
2314 initial=initial, where=where)

File ~/venv/lib/python3.10/site-packages/numpy/core/fromnumeric.py:88, in _wrapreduction(obj, ufunc, method, axis, dtype, out, **kwargs)
85 else:
86 return reduction(axis=axis, out=out, **passkwargs)
---> 88 return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
AxisError: axis 1 is out of bounds for array of dimension 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests