Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot understand choice of mm_hidden_size 1024 #123

Open
jzyee opened this issue Aug 13, 2024 · 0 comments
Open

Cannot understand choice of mm_hidden_size 1024 #123

jzyee opened this issue Aug 13, 2024 · 0 comments

Comments

@jzyee
Copy link

jzyee commented Aug 13, 2024

Trying to understand how the spatial and temporal features fit into the projection layer. Based on the config file used to assign the mm.hidden_size on huggingface, it is 1024.

huggingface link: https://huggingface.co/mmaaz60/LLaVA-7B-Lightening-v1-1/blob/main/config.json

image

From what I understand, the frames are sampled at 100 frames and the clip encoder outputs a vector of 1024. A temporal mean will result in a vector of (number of patches, 1024) and a spatial mean of each frame will result in a (100(vector which size is the number of frames), 1024) does this mean the input shape of the projection layer is (num of patches + 100, 1024)?

I don't understand how the projection layer of 1024 accepts this size

@jzyee jzyee changed the title Cannot find config file Cannot understand choice of mm_hidden_size 1024 Aug 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant