Pos embedding makes sense in LoFTR? #201

lee-vius · 2022-08-02T07:32:30Z

lee-vius
Aug 2, 2022

I am looking into the transformer encoder in LoFTR, and I currently think that position embedding in LoFTR seems to draw the performance down.

In the paper, I noticed that ablation study shows applying pos-embedding for each layer in encoder will draw the performance, and I also conduct some experiments which also show that pos-embedding doesn't give the network more power.
I looked into the implementation of pos-embedding. Different from usage in other research work (Obj detection like DETR, or CoTR for feature matching), the position embedding in LoFTR is added to all query, key, value.

Now I want to give my thoughts of position embedding in LoFTR based on points above, and hope someone can leave comments.

First, I think position embedding in LoFTR only makes sense for self-attention, and it serves as drawback in cross-attention. Suppose a feature point in Left image is building relation with its correspondence in Right image, it will get confused because the position-embeddings at these two points are tolly different in most cases (because their pixel coordinates are different). I think the good case is: when a feature point in left image is building relation with its correspondence, it will be more confident if the position embedding at these two points are almost the same. In other word, for now the position embedding is more like noise to image feature? And that's why adding more position embedding at each encoder layer will strangely draw performance down.

Second, about how position embedding can work. Based on papers I have tasted, I think position embedding can work better with transformer decoder. In transformer decoder, we use some queries which are embedded with query pos to build relation with memory output by transformer encoder. And here the query pos serves as the same thing as what position embedding does. The position information here enables the query to learn the targeted parts of the memory to generate more meaningful information.

So I quite wonder that anyone has tried to train LoFTR without any pos-embedding to see how that will affect performance? Or anyone can give some explanation on the phenomenon above?

HUSTNO1WXY · 2022-08-07T04:13:53Z

HUSTNO1WXY
Aug 7, 2022

I have trained LoFTR without pos-embedding, and I find that the performance degenerates.

0 replies

ghost · 2022-09-22T09:51:31Z

ghost
Sep 22, 2022

I think positional encoding is the very key component for loftr, which gives it ability to generate position-aware features and motion-consistent matching results. I agree that positional encodings are different for an image pair at the very beginning, but they eventually converge to the same point with several cross attention layers.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pos embedding makes sense in LoFTR? #201

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Pos embedding makes sense in LoFTR? #201

lee-vius Aug 2, 2022

Replies: 2 comments

HUSTNO1WXY Aug 7, 2022

ghost Sep 22, 2022

lee-vius
Aug 2, 2022

HUSTNO1WXY
Aug 7, 2022

ghost
Sep 22, 2022