Cannot understand choice of mm_hidden_size 1024 #123

jzyee · 2024-08-13T19:12:41Z

Trying to understand how the spatial and temporal features fit into the projection layer. Based on the config file used to assign the mm.hidden_size on huggingface, it is 1024.

huggingface link: https://huggingface.co/mmaaz60/LLaVA-7B-Lightening-v1-1/blob/main/config.json

From what I understand, the frames are sampled at 100 frames and the clip encoder outputs a vector of 1024. A temporal mean will result in a vector of (number of patches, 1024) and a spatial mean of each frame will result in a (100(vector which size is the number of frames), 1024) does this mean the input shape of the projection layer is (num of patches + 100, 1024)?

I don't understand how the projection layer of 1024 accepts this size

jzyee changed the title ~~Cannot find config file~~ Cannot understand choice of mm_hidden_size 1024 Aug 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot understand choice of mm_hidden_size 1024 #123

Cannot understand choice of mm_hidden_size 1024 #123

jzyee commented Aug 13, 2024 •

edited

Loading

Cannot understand choice of mm_hidden_size 1024 #123

Cannot understand choice of mm_hidden_size 1024 #123

Comments

jzyee commented Aug 13, 2024 • edited Loading

jzyee commented Aug 13, 2024 •

edited

Loading