You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Aug 31, 2021. It is now read-only.
Why should the position encoding change along the dimension of the word embedding?
Shouldn't the entire embedding be multiplied elementwise by a constant value?
Consider the sentence "john, went, to, the, hallway", doesn't it suffice to multiply element-wise "john" by a small constant value, say 0.1, and, the last word "hallway" by a larger value.
I am trying to understand the reason behind varying the weight of position encoding along the dimension of a word embedding
The text was updated successfully, but these errors were encountered:
deepaksuresh
changed the title
Can't position embedding be like temporal embedding
Can't position encoding be like temporal encoding
Feb 24, 2019
temporal embeddings: they added to each sentence representation to preserve their ordering information. They are just like "position embeddings" used Transformer models.
Position encoding: this is a simple trick to preserve ordering of words within a sentence. Because we are doing bag-of-words, anything like "adding an embedding" will not work, which is why we do a multiplication. But a multiplication by a scalar is not a good idea because you can't distinguish "0.5room + 0.5room" from "1.0*room". In addition, a word multiplied by 0.1 is less likely to have an effect on the output, so it becomes an unnecessary bias in the model.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Why should the position encoding change along the dimension of the word embedding?
Shouldn't the entire embedding be multiplied elementwise by a constant value?
Consider the sentence "john, went, to, the, hallway", doesn't it suffice to multiply element-wise "john" by a small constant value, say 0.1, and, the last word "hallway" by a larger value.
I am trying to understand the reason behind varying the weight of position encoding along the dimension of a word embedding
The text was updated successfully, but these errors were encountered: