Replies: 2 comments
-
I have trained LoFTR without pos-embedding, and I find that the performance degenerates. |
Beta Was this translation helpful? Give feedback.
0 replies
-
I think positional encoding is the very key component for loftr, which gives it ability to generate position-aware features and motion-consistent matching results. I agree that positional encodings are different for an image pair at the very beginning, but they eventually converge to the same point with several cross attention layers. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am looking into the transformer encoder in LoFTR, and I currently think that position embedding in LoFTR seems to draw the performance down.
In the paper, I noticed that ablation study shows applying pos-embedding for each layer in encoder will draw the performance, and I also conduct some experiments which also show that pos-embedding doesn't give the network more power.
I looked into the implementation of pos-embedding. Different from usage in other research work (Obj detection like DETR, or CoTR for feature matching), the position embedding in LoFTR is added to all query, key, value.
Now I want to give my thoughts of position embedding in LoFTR based on points above, and hope someone can leave comments.
First, I think position embedding in LoFTR only makes sense for self-attention, and it serves as drawback in cross-attention. Suppose a feature point in Left image is building relation with its correspondence in Right image, it will get confused because the position-embeddings at these two points are tolly different in most cases (because their pixel coordinates are different). I think the good case is: when a feature point in left image is building relation with its correspondence, it will be more confident if the position embedding at these two points are almost the same. In other word, for now the position embedding is more like noise to image feature? And that's why adding more position embedding at each encoder layer will strangely draw performance down.
Second, about how position embedding can work. Based on papers I have tasted, I think position embedding can work better with transformer decoder. In transformer decoder, we use some queries which are embedded with query pos to build relation with memory output by transformer encoder. And here the query pos serves as the same thing as what position embedding does. The position information here enables the query to learn the targeted parts of the memory to generate more meaningful information.
So I quite wonder that anyone has tried to train LoFTR without any pos-embedding to see how that will affect performance? Or anyone can give some explanation on the phenomenon above?
Beta Was this translation helpful? Give feedback.
All reactions