how you handle inputs of different sizes? #24

v-iashin · 2021-03-19T05:10:36Z

v-iashin
Mar 19, 2021
Maintainer

I have got this question via email and reposting it here with the author’s permission.

I had a question about how you handle inputs of different sizes. If I were to reproduce your results using just the video stream, how would I handle inputs of different video sizes? For example, you stated that I3D produces 1024-d vectors, what if my input to the transformer is of varying length? I.e. my input videos change in length each time? This would produce different size embeddings for I3D each time. It wasn't clear to me in the paper how this was handled. Do you mind clarifying?

Answered by v-iashin

Mar 19, 2021

Thanks for your question.

If you open features archives, you will observe that all features have different lengths because videos are of different durations. The features are (Tv x 1024) for i3d and (Ta x 256) for vggish. So, 1024-d and 256-d are not temporal dimensions.

During proposal generation training we need full videos. Therefore, they are either used as-is or trimmed to max_feature_len (800 for audio and 300 for visual which is temporarily equivalent). While during the training of the captioning module we trim the set of features according to the ground truth time step. This pipeline is defined here:

BMT/datasets/load_features.py

Line 46 in 18b5ee1

def load_features_fr…

View full answer

v-iashin · 2021-03-19T05:13:36Z

v-iashin
Mar 19, 2021
Maintainer Author

Thanks for your question.

If you open features archives, you will observe that all features have different lengths because videos are of different durations. The features are (Tv x 1024) for i3d and (Ta x 256) for vggish. So, 1024-d and 256-d are not temporal dimensions.

During proposal generation training we need full videos. Therefore, they are either used as-is or trimmed to max_feature_len (800 for audio and 300 for visual which is temporarily equivalent). While during the training of the captioning module we trim the set of features according to the ground truth time step. This pipeline is defined here:

BMT/datasets/load_features.py

Line 46 in 18b5ee1

    
           def load_features_from_npy(cfg, feature_names_list, video_id, start, end, duration,

To form a batch with different Tv and Ta among samples, we use a standard procedure called padding which is defined here:

BMT/datasets/captioning_dataset.py

Lines 256 to 258 in 18b5ee1

    
           vid_stacks_rgb = pad_sequence(vid_stacks_rgb, batch_first=True, padding_value=self.pad_idx) 
        
           vid_stacks_flow = pad_sequence(vid_stacks_flow, batch_first=True, padding_value=0) 
        
           aud_stacks = pad_sequence(aud_stacks, batch_first=True, padding_value=self.pad_idx)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how you handle inputs of different sizes? #24

{{title}}

Replies: 1 comment

{{title}}

Select a reply

how you handle inputs of different sizes? #24

v-iashin Mar 19, 2021 Maintainer

Replies: 1 comment

v-iashin Mar 19, 2021 Maintainer Author

v-iashin
Mar 19, 2021
Maintainer

v-iashin
Mar 19, 2021
Maintainer Author