hackerllama/blog/posts/random_transformer/ #3

utterances-bot · 2024-01-02T15:03:05Z

hackerllama - The Random Transformer

Understand how transformers work by demystifying all the math behind them

https://osanseviero.github.io/hackerllama/blog/posts/random_transformer/

saeedzou · 2024-01-02T15:03:06Z

Great content. I love these simplifications of complex topics such as transformers. I would like to see more of this.
One thing I like to add is if the encoder decoder attention code is correct.
Shouldn't 6. Encoder-decoder attention

Z_encoder_decoder = multi_head_encoder_decoder_attention(
    E, Z_self_attention, WQs, WKs, WVs
)

be instead

Z_encoder_decoder = multi_head_encoder_decoder_attention(
    encoder_output, Z_self_attention, WQs, WKs, WVs
)

osanseviero · 2024-01-02T15:26:57Z

Thanks! That's correct, I fixed it in 1a380ec

ijbo · 2024-01-03T08:18:59Z

Thanks for the post and explanation :

Though I am still reading the article - My question is regarding this para :
"The first step is to turn each input token into a vector using an embedding algorithm. This is a learned encoding. Usually we use a big vector size such as 512, but let’s do 4 for our example so we can keep the maths manageable."

Hello -> [1,2,3,4] World -> [2,3,4,5]

ques 1-> What is the embedding algorithm ? to get the Hello -> [1,2,3,4] World -> [2,3,4,5].
ques 2 -> Or to start with this is initialized (Hello -> [1,2,3,4] World -> [2,3,4,5]) with some random values and improved by the feed forward network as you have mentioned "This is a learned encoding" ?
ques 3 -> As per the paper what is their choice ? and why they have not used a word2vec or some already good embedding algorithm ? rather than learning it from scratch.

edmundronald · 2024-01-04T05:07:12Z

Ijbo asks an interesting question. My answer -as of now with a novice's understanding- would be that in the case where one is creating a foundation model, the token==> vector embedding function itself (and its canonical extension to the vocabulary) is one of the main results of the training process. We could use this page to discuss this.

osanseviero · 2024-01-04T08:36:44Z

Hey @ijbo and @edmundronald . I updated the first section by adding a step 0 (tokenization) and significantly providing more context in the embedding step. Please let me know if that helps clarify things!

Although we could use a pre-trained word embedding (such as word2vec), transformers learn the best embedding for their specific task/data as part of their training process. This will help obtain the best representation for the task!

Thanks for the feedback!

ijbo · 2024-01-04T12:28:41Z

Thanks for the update and explanation :
Reg the next section :Positional encoding

ques1 -> Can you please describe what is dmodel ? and its significance in the denominator ?
ques2 -> What is the contextual meaning of this (pos/ 10000 2i/dmodel) wrt positional information ?
ques3 -> " i " in the formula as per the above explanation means the position of Characters in "HELLO" ?

simonrnss · 2024-01-04T14:29:30Z

Great doc -- really useful!
Got a quick q about the decoder -- your sequence variable gets re-assigned each time through the loop as just the embedding of the most recent token. But if I understand correctly, the decoder is always given multiple embeddings (with masks used to stop it attending to things in the future). Can you clarify please? If the decoder is given more than one row at once, how does it convert the its output into a single set of probabilities? Thanks!

smortezah · 2024-01-04T15:28:43Z

Great article!

osanseviero · 2024-01-04T15:42:36Z

Hi @simonrnss! Thanks for the question! I fixed a bug + improved the last section to be clearer. I recommend re-reading the "Generating the output sequence" section or looking at the commit

ijbo · 2024-01-04T16:38:25Z

This explains, the choice of the positional embedding function -> https://youtu.be/1biZfFLPRSY

jbash · 2024-01-04T17:45:05Z

The embedding above has no information about the position of the word in the sentence

Sure it does. It unambiguously encodes the exact position of every token. The first token is the first row, and the second token is the second row. You might want to explain at this point why that's not usable by the following steps.

simonrnss · 2024-01-05T12:50:50Z

Thanks @osanseviero -- that matches my understanding better now. For anyone else looking at that section, it's also useful to look at the definition of the greedy_decode method in the annotated transformer (https://nlp.seas.harvard.edu/2018/04/03/attention.html)

potamitis123 · 2024-01-07T19:04:35Z

It looks great! Thank you!. How about training? Can you please clarify that in another post? Especially on how K,Q,V are updated

Nikolaj-K · 2024-01-08T20:28:49Z

Spotted what is probably a typo, namely in the sentence
"These brings some interesting properties"

On the nose, the sentence "The embedding above has no information about the position of the word in the sentence" is a bit strange, given the matrix E reflects the order of the words. From E you'd still know it, but it's not part of the matrix entry values themselves.

Finally, maybe you can tweak the page such that the section counting doesn't restart at each chapter.

Best.

simonrnss · 2024-01-09T13:50:56Z

Spotted what is probably a typo, namely in the sentence "These brings some interesting properties"

On the nose, the sentence "The embedding above has no information about the position of the word in the sentence" is a bit strange, given the matrix E reflects the order of the words. From E you'd still know it, but it's not part of the matrix entry values themselves.

Finally, maybe you can tweak the page such that the section counting doesn't restart at each chapter.

Best.

FWIW, I guess a wording like "The individual embeddings in the matrix contain no information about the position of the words in the sentence" would be clearer for that bit.

osanseviero · 2024-01-09T17:53:00Z

Thanks all! I clarified that bit; thanks for the feedback/ideas

nikilpatel94 · 2024-01-10T21:34:09Z

Good detailed intuitive post. The most important part was to do it using simple Numpy. The simplest possible implementation I have ever seen so far.
However, I have a concern in the last "Generating the output sequence" part. With the random nature of everything and with a smaller vocab size, the generated results seems random rather than 'translated'. If I add more non-relevant words in the vocab of either languages, the generation will be a little bit more non-relevant, even with a relatively larger sizes of embedding and matrices. I faced this issue when I tried with a different language translation.
Does this make sense?

osanseviero · 2024-01-10T22:17:19Z

Thanks for the positive feedback @nikilpatel94

It makes sense that the output is random - we randomly initialized all weights. I chose a small vocabulary with cherry-picked words, and hence the probability of a relevant output is higher. But it's still random!

In practice, all model weights will be learned with a dataset. To do something like that, you can implement the whole network in PyTorch, as in https://nlp.seas.harvard.edu/annotated-transformer/ , for a deeper dive into training code with simple implementation. If you know a bit of PyTorch and read this blog post, you should be able to jump into the annotated transformer

naliazheli · 2024-01-16T10:41:59Z

scores1

array([[4.67695573e-10, 1.00000000e+00],
[1.11377182e-12, 1.00000000e+00]])

attention1 = scores1 @ V1
attention1

array([[7.99, 8.84, 6.84],
[7.99, 8.84, 6.84]])

scores1 has different shape with V1

theprismdata · 2024-11-20T06:02:23Z

I think this is a very good post. I don't understand it in the mathematical part yet, but I'm trying to understand it.
I left a post because of a question and error while executing the sample code.
You wrote in Linear layer
def linear(x, W, b):
return np.dot(x, W) + b
and
x = linear([1, 0, 1, 0], np.random.randn(4, 10), np.random.randn(10))
but in this case, I met error in softmax(x)

AxisError Traceback (most recent call last)
Cell In[48], line 1
----> 1 softmax(x)
Cell In[10], line 2, in softmax(x)
1 def softmax(x):
----> 2 return np.exp(x) / np.sum(np.exp(x), axis=1, keepdims=True)
File ~/venv/lib/python3.10/site-packages/numpy/core/fromnumeric.py:2313, in sum(a, axis, dtype, out, keepdims, initial, where)
2310 return out
2311 return res
-> 2313 return _wrapreduction(a, np.add, 'sum', axis, dtype, out, keepdims=keepdims,
2314 initial=initial, where=where)

File ~/venv/lib/python3.10/site-packages/numpy/core/fromnumeric.py:88, in _wrapreduction(obj, ufunc, method, axis, dtype, out, **kwargs)
85 else:
86 return reduction(axis=axis, out=out, **passkwargs)
---> 88 return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
AxisError: axis 1 is out of bounds for array of dimension 1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hackerllama/blog/posts/random_transformer/ #3

hackerllama/blog/posts/random_transformer/ #3

utterances-bot commented Jan 2, 2024

saeedzou commented Jan 2, 2024

osanseviero commented Jan 2, 2024

ijbo commented Jan 3, 2024

edmundronald commented Jan 4, 2024

osanseviero commented Jan 4, 2024 •

edited

Loading

ijbo commented Jan 4, 2024

simonrnss commented Jan 4, 2024

smortezah commented Jan 4, 2024

osanseviero commented Jan 4, 2024

ijbo commented Jan 4, 2024

jbash commented Jan 4, 2024

simonrnss commented Jan 5, 2024

potamitis123 commented Jan 7, 2024

Nikolaj-K commented Jan 8, 2024

simonrnss commented Jan 9, 2024

osanseviero commented Jan 9, 2024

nikilpatel94 commented Jan 10, 2024

osanseviero commented Jan 10, 2024

naliazheli commented Jan 16, 2024

theprismdata commented Nov 20, 2024

hackerllama/blog/posts/random_transformer/ #3

hackerllama/blog/posts/random_transformer/ #3

Comments

utterances-bot commented Jan 2, 2024

hackerllama - The Random Transformer

saeedzou commented Jan 2, 2024

osanseviero commented Jan 2, 2024

ijbo commented Jan 3, 2024

edmundronald commented Jan 4, 2024

osanseviero commented Jan 4, 2024 • edited Loading

ijbo commented Jan 4, 2024

simonrnss commented Jan 4, 2024

smortezah commented Jan 4, 2024

osanseviero commented Jan 4, 2024

ijbo commented Jan 4, 2024

jbash commented Jan 4, 2024

simonrnss commented Jan 5, 2024

potamitis123 commented Jan 7, 2024

Nikolaj-K commented Jan 8, 2024

simonrnss commented Jan 9, 2024

osanseviero commented Jan 9, 2024

nikilpatel94 commented Jan 10, 2024

osanseviero commented Jan 10, 2024

naliazheli commented Jan 16, 2024

theprismdata commented Nov 20, 2024

osanseviero commented Jan 4, 2024 •

edited

Loading