-
I have got this question via email and reposting it here with the author’s permission.
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Thanks for your question. If you open features archives, you will observe that all features have different lengths because videos are of different durations. The features are ( During proposal generation training we need full videos. Therefore, they are either used as-is or trimmed to Line 46 in 18b5ee1 To form a batch with different BMT/datasets/captioning_dataset.py Lines 256 to 258 in 18b5ee1 |
Beta Was this translation helpful? Give feedback.
Thanks for your question.
If you open features archives, you will observe that all features have different lengths because videos are of different durations. The features are (
Tv x 1024
) for i3d and (Ta x 256
) for vggish. So, 1024-d and 256-d are not temporal dimensions.During proposal generation training we need full videos. Therefore, they are either used as-is or trimmed to
max_feature_len
(800 for audio and 300 for visual which is temporarily equivalent). While during the training of the captioning module we trim the set of features according to the ground truth time step. This pipeline is defined here:BMT/datasets/load_features.py
Line 46 in 18b5ee1