-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hackerllama/blog/posts/random_transformer/ #3
Comments
Great content. I love these simplifications of complex topics such as transformers. I would like to see more of this. Z_encoder_decoder = multi_head_encoder_decoder_attention(
E, Z_self_attention, WQs, WKs, WVs
) be instead Z_encoder_decoder = multi_head_encoder_decoder_attention(
encoder_output, Z_self_attention, WQs, WKs, WVs
) |
Thanks! That's correct, I fixed it in 1a380ec |
Thanks for the post and explanation : Though I am still reading the article - My question is regarding this para : Hello -> [1,2,3,4] World -> [2,3,4,5] ques 1-> What is the embedding algorithm ? to get the Hello -> [1,2,3,4] World -> [2,3,4,5]. |
Ijbo asks an interesting question. My answer -as of now with a novice's understanding- would be that in the case where one is creating a foundation model, the token==> vector embedding function itself (and its canonical extension to the vocabulary) is one of the main results of the training process. We could use this page to discuss this. |
Hey @ijbo and @edmundronald . I updated the first section by adding a step 0 (tokenization) and significantly providing more context in the embedding step. Please let me know if that helps clarify things! Although we could use a pre-trained word embedding (such as word2vec), transformers learn the best embedding for their specific task/data as part of their training process. This will help obtain the best representation for the task! Thanks for the feedback! |
Thanks for the update and explanation : ques1 -> Can you please describe what is dmodel ? and its significance in the denominator ? |
Great doc -- really useful! |
Great article! |
Hi @simonrnss! Thanks for the question! I fixed a bug + improved the last section to be clearer. I recommend re-reading the "Generating the output sequence" section or looking at the commit |
This explains, the choice of the positional embedding function -> https://youtu.be/1biZfFLPRSY |
Sure it does. It unambiguously encodes the exact position of every token. The first token is the first row, and the second token is the second row. You might want to explain at this point why that's not usable by the following steps. |
Thanks @osanseviero -- that matches my understanding better now. For anyone else looking at that section, it's also useful to look at the definition of the |
It looks great! Thank you!. How about training? Can you please clarify that in another post? Especially on how K,Q,V are updated |
Spotted what is probably a typo, namely in the sentence On the nose, the sentence "The embedding above has no information about the position of the word in the sentence" is a bit strange, given the matrix E reflects the order of the words. From E you'd still know it, but it's not part of the matrix entry values themselves. Finally, maybe you can tweak the page such that the section counting doesn't restart at each chapter. Best. |
FWIW, I guess a wording like "The individual embeddings in the matrix contain no information about the position of the words in the sentence" would be clearer for that bit. |
Thanks all! I clarified that bit; thanks for the feedback/ideas |
Good detailed intuitive post. The most important part was to do it using simple Numpy. The simplest possible implementation I have ever seen so far. |
Thanks for the positive feedback @nikilpatel94 It makes sense that the output is random - we randomly initialized all weights. I chose a small vocabulary with cherry-picked words, and hence the probability of a relevant output is higher. But it's still random! In practice, all model weights will be learned with a dataset. To do something like that, you can implement the whole network in PyTorch, as in https://nlp.seas.harvard.edu/annotated-transformer/ , for a deeper dive into training code with simple implementation. If you know a bit of PyTorch and read this blog post, you should be able to jump into the annotated transformer |
scores1 array([[4.67695573e-10, 1.00000000e+00], attention1 = scores1 @ V1 array([[7.99, 8.84, 6.84], scores1 has different shape with V1 |
I think this is a very good post. I don't understand it in the mathematical part yet, but I'm trying to understand it.
|
hackerllama - The Random Transformer
Understand how transformers work by demystifying all the math behind them
https://osanseviero.github.io/hackerllama/blog/posts/random_transformer/
The text was updated successfully, but these errors were encountered: