Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can Visual Prompt Encoder handle a number of classes in batch? #89

Open
nijkah opened this issue Nov 6, 2024 · 1 comment
Open

Comments

@nijkah
Copy link

nijkah commented Nov 6, 2024

Hello, thank you for your work and for sharing the code.

After reading your paper, I have some questions regarding the implementation of the visual prompt encoder.

From previous issues, I understand that a visual prompt is generated per class in each batch. This seems to imply that the MSDeformAttentionLayer must run multiple times, proportional to the number of classes per image and then across the entire batch.

For example, in a batch with two images, image 1 with (classes a, b, c), and image 2 with (classes d, e, f)—we would need to execute MSDeformAttentionLayer for a total of 2 (images) * 6 (prompts for each class) times.

When the batch size or the number of classes per batch increases, this approach seems to demand a substantial amount of memory, possibly making it challenging to scale. Or it can be computed using for loop, but it also takes a long time.

Could you clarify how you handle this demand?

Thank you!

@Mountchicken
Copy link
Collaborator

Hi @nijkah
Sorry for the late reply.

Indeed, we do use a for loop to do this. For each category of each image in each batch, Deformable is called once individually, and while this may sound time-consuming, we found that this part of the training process is not particularly time-consuming and that the model is still fast to train overall.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants