You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Trying to understand how the spatial and temporal features fit into the projection layer. Based on the config file used to assign the mm.hidden_size on huggingface, it is 1024.
From what I understand, the frames are sampled at 100 frames and the clip encoder outputs a vector of 1024. A temporal mean will result in a vector of (number of patches, 1024) and a spatial mean of each frame will result in a (100(vector which size is the number of frames), 1024) does this mean the input shape of the projection layer is (num of patches + 100, 1024)?
I don't understand how the projection layer of 1024 accepts this size
The text was updated successfully, but these errors were encountered:
jzyee
changed the title
Cannot find config file
Cannot understand choice of mm_hidden_size 1024
Aug 13, 2024
Trying to understand how the spatial and temporal features fit into the projection layer. Based on the config file used to assign the mm.hidden_size on huggingface, it is 1024.
huggingface link: https://huggingface.co/mmaaz60/LLaVA-7B-Lightening-v1-1/blob/main/config.json
From what I understand, the frames are sampled at 100 frames and the clip encoder outputs a vector of 1024. A temporal mean will result in a vector of (number of patches, 1024) and a spatial mean of each frame will result in a (100(vector which size is the number of frames), 1024) does this mean the input shape of the projection layer is (num of patches + 100, 1024)?
I don't understand how the projection layer of 1024 accepts this size
The text was updated successfully, but these errors were encountered: