You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, thank you for your work and for sharing the code.
After reading your paper, I have some questions regarding the implementation of the visual prompt encoder.
From previous issues, I understand that a visual prompt is generated per class in each batch. This seems to imply that the MSDeformAttentionLayer must run multiple times, proportional to the number of classes per image and then across the entire batch.
For example, in a batch with two images, image 1 with (classes a, b, c), and image 2 with (classes d, e, f)—we would need to execute MSDeformAttentionLayer for a total of 2 (images) * 6 (prompts for each class) times.
When the batch size or the number of classes per batch increases, this approach seems to demand a substantial amount of memory, possibly making it challenging to scale. Or it can be computed using for loop, but it also takes a long time.
Could you clarify how you handle this demand?
Thank you!
The text was updated successfully, but these errors were encountered:
Indeed, we do use a for loop to do this. For each category of each image in each batch, Deformable is called once individually, and while this may sound time-consuming, we found that this part of the training process is not particularly time-consuming and that the model is still fast to train overall.
Hello, thank you for your work and for sharing the code.
After reading your paper, I have some questions regarding the implementation of the visual prompt encoder.
From previous issues, I understand that a visual prompt is generated per class in each batch. This seems to imply that the
MSDeformAttentionLayer
must run multiple times, proportional to the number of classes per image and then across the entire batch.For example, in a batch with two images, image 1 with (classes a, b, c), and image 2 with (classes d, e, f)—we would need to execute
MSDeformAttentionLayer
for a total of 2 (images) * 6 (prompts for each class) times.When the batch size or the number of classes per batch increases, this approach seems to demand a substantial amount of memory, possibly making it challenging to scale. Or it can be computed using
for loop
, but it also takes a long time.Could you clarify how you handle this demand?
Thank you!
The text was updated successfully, but these errors were encountered: