Replies: 1 comment
-
The reason for this is because of the Resolution in the coarse matching. This means that a coarse match is done at every 8 pixels. Another reason is due to the input into the Vision transformer as the token size is reduced. This is why it is recomended to use a Image size that is divisible by 8. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
The model seems to run fine even when image sizes are not divisible by 8. Then why is it stated that it should be divisible by 8 ?
Beta Was this translation helpful? Give feedback.
All reactions