-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion on training issues I have encountered #8
Comments
I did use pretrained weight from ViT |
Hi, I also meet the question about training acc. I train model using pertrained weight from ViT, but the acc is lower than a baseline model (only ViT backbone, without time attention). I train on 4V100 GPU, so the batch size can be set as 48=32. I randomly initialize the weight of model that not contained in ViT (HWT position embeding, time attention). |
I got lower train/val performance too. |
@zmy1116 perhaps you could share/PR the code you used for initializing with pre-trained ViT? |
https://arxiv.org/pdf/2103.15691.pdf |
Yes, I understand that. However, initializing TimeSformer with ViT weights does require some additional code, primarily for mapping the pre-trained weight names to the TimeSformer weight names. If you've already done this, it would be nice of you to share so we can avoid reinventing the wheel. I'm doing it right now, and will post what I do if you don't want to share. |
please do |
Hm, posts an issue requesting help from others on an open-source project, but won't contribute/share code themselves. Fascinating! |
won't contribute seems harsh, I did share some of my findings, I shared where the original VIT weights are and I did read multiple papers and shared the google paper and showed the section on weights initialization. I would think these count something. I would like to think I did provide more information than most of people who posts issues in this repo. |
@mckinziebrandon |
I'm still very new to the ML space, but would like to contribute as well. I'd be happy to contribute some documentation if we could hop on a Discord/Slack chat and discuss. Really interested in seeing how this model performs and how I can apply it to some videos. Thanks! |
With multiple rounds of changes and testing, I am able to reproduce similar (not better) result on Kinetics700_2020 with video transformer. I did the following:
After 30 epochs I'm getting ~62% accuracy on kinetics700_2020 with multi-views. My best model (with X3D-M) on this dataset was ~63.4% with multi-views. I don't think it's good but I don't see any results for this dataset online. The closest public model I could find on this is from SenseTime's lab on K700, they get 64% with multi-views. So I would say with video transformer i can get a reasonable model , and the training time is around 30h on an 8GPU machine, which I find very interesting. |
@zmy1116 Thanks for sharing the training configuration. I do the same thing that finding a good learning rate with warm up. In my setting, I use lr=0.01 and warm up for 2 epochs, model 3 (TimeSformer). But in Kinetics400, I can't get reasonable results. I will test the configuration you shared. |
Also I used the tublet embedding as the paper suggested, I use tublet of size No i didn't try on k400.... i realized long time ago that since the papers are done on k400, i should do these experiments on k400... but it's a huge pain to download the dataset (need to launch multiple machines to download/ rotate proxies so i don't get banned/ verify corrupted downloads etc...)... |
Thanks for @zmy1116 's sharing. I will share the results on kinetics400 after testing this configuration. |
Thanks @zmy1116. I was able to load the pretrained ViT weights into TimeSformer with the following modifications.
I used the following regex mapping to go from the ViT weight names to the TimeSformer names: mapping = {
'cls_token': 'timesformer.cls_token',
'patch_embed\.(.*)': 'timesformer.patch_embed.\1',
r'blocks\.(\d+).norm1\.(.*)': r'timesformer.layers.\1.1.norm.\2',
r'blocks\.(\d+).norm2\.(.*)': r'timesformer.layers.\1.2.norm.\2',
r'blocks\.(\d+).attn\.qkv\.weight': r'timesformer.layers.\1.1.fn.to_qkv.weight',
r'blocks\.(\d+).attn\.proj\.(.*)': r'timesformer.layers.\1.1.fn.to_out.0.\2',
r'blocks\.(\d+).mlp\.fc1\.(.*)': r'timesformer.layers.\1.2.fn.net.0.\2',
r'blocks\.(\d+).mlp\.fc2\.(.*)': r'timesformer.layers.\1.2.fn.net.3.\2',
} The pretrained model I used was obtained through the vit_base_patch32_224_in21k = timm.create_model(
'vit_base_patch32_224_in21k',
pretrained=True) Finally, I also tried initializing the temporal attention submodule's weights to zeros, as recommended by the ViViT paper: def zero(m):
if hasattr(m, 'weight') and m.weight is not None:
nn.init.zeros_(m.weight)
if hasattr(m, 'bias') and m.bias is not None:
nn.init.zeros_(m.bias)
for layer in self.timesformer.layers:
prenorm_temporal_attn: nn.Module = layer[0]
prenorm_temporal_attn.apply(zero) Note: I'm using an internal framework so a full copy/paste of my code wouldn't make sense to anyone, but the above description is everything I've tried so far. Still need to tweak/debug more though. After 80 epochs I'm still only getting about 55% validation accuracy on Kinetics-400 (@Hanqer), compared to the 40% I was getting without using pretrained ViT weights. Also, FWIW I am able to overfit the training data quite easily (no surprise there) and reach nearly 100 percent training accuracy with enough epochs. |
@zmy1116 have you tried using adamw instead of SGD? I'm currently trying variants of
For the data, I'm passing doing the same processing as X3D_M.yaml (appears your configs are setup like this). A typical GPU setup is 2 machines with 8 V100 GPUs each. |
I was not able to get good training with adam/adamw ... but I have not tested with many configurations because of limited resources |
One problem I am trying to figure out is test time inference. With X3D/SlowFast/I3D, they do the 30 crops:
Even they train on 224x224, at inference they can still do 256x256 because it's a conv net and with 3 crops they can cover the entire image Now with transformer, I train with 224x224 and we can't feed in 256x256 inputs at inference time (unless we modify the network). So for spatial crop I can:
You wouldn't think it matters, but to my surprise, the second method actually produce better result, even if we cover more space with first method. I guess it's because during training, I resize the shorter side between 256 to 320 so the model is more used to certain resolution? Alternatively, I'm thinking about if I should modify the network at test time so the model can take 256x256 inputs:
This is the first time I work on transformer...I think intuitively it makes sense but I'm not sure..... |
It looks like in original vision transformer after embedding we do a drop out immediately
In this implementation we don't. I'm probably chasing rabbit... I don't think it really matters, but since I can't produce better result I'm looking at everything that may cause the discrepancy.. |
Another maybe irrelevant comment: I'm starting to examine attention weights using Some examples on spatial attentions: Some examples on temporal attentions (lighter frames represent high attention frames): |
@zmy1116 Hi, As for the there crop you mentioned above. I think, eventually, the ConvNet also takes 224x224 input in the training phase. So could you forward the transformer with:
And for 256x256 input for transformer, it actually should modify the model, especially the positional embedding (need a interpolation to fit the different number of tokens). |
I tried to run transformer trained on 224x224 with inputs 256x256 by extending number of positional embeddings (so instead of 14x14 tokens, I have now 16x16 so we have one round of extra positional embedding over the edge, which I interpolate with nearest neighbor value).. The performance is slightly worse (-0.5%) and the computation time increased a lot.. |
Interesting attention plots @zmy1116. I've been tracking plots of gradient norms and the associated weight histograms, and I've observed something odd that appears to be caused by zeroing out the temporal attention block during initialization (as recommended by ViVit model 3): the gradients are extremely low and the weights simply do not budge from their initialized values. For example, here is the histogram (x-axis is time) of the layer norm and qkv weight: I've also tried only initializing the attention weights themselves with zeros (qkv and to_out) but that does not resolve the issue. I'm now able to get quite good validation accuracy on kinetics-400 compared to before (~70%) but it appears the model is basically ignoring the temporal attention block. Have you looked at your weight/gradient histograms @zmy1116 ? Also note that my 70% validation accuracy is single-crop. No aggregation. |
For moving to 256x256, have you seen this snippet in the timm repo for ViT @zmy1116? https://github.com/rwightman/pytorch-image-models/blob/779107b693010934ac87c8cecbeb65796e218488/timm/models/vision_transformer.py#L386 |
@mckinziebrandon , ehmmmmmm below list per weight the distribution (in absolute value) for temporal related weights...Indeed they are definitely smaller than spatial related weights..... |
Ah, is this timesformer or model 2 of vivit? I'm confused by your naming distinction of "temporal_encoder_layers" and "spatial_encoder_layers". Mine is just the single timesformer model, and the |
this is the model 2 as this is the one I was able to get good result... I have not succeeded with TimeSformer ever |
Gotcha. Well in any case, interesting that we're observing something similar for both models. Hmmm. |
For anybody who's interested, you can find the official implementation of the paper in this repo: https://github.com/facebookresearch/TimeSformer |
num_frames = 32 Could you tell me what's relationship between sample_rate and target_fps? what is the sampling strategy?I am very confused about the specific sampling process. Why don't apply the sampling strategy in TSN(Temporal Segment Networks)? |
Thank you for the implementation for the paper. This is the first time I'm dealing with transformer model, I tried to train over Kinetics700 dataset using this model. and I just want to share some of the issues I have encountered:
The paper suggested that the model works better with pretrained weights. Although this is a direct extension from image transformer, most of vision transformer's weights should apply directly, there are 2 places that are different:
Since it is the first time I'm dealing with Transformer, I want to reproduce what the paper claimed so I started with the "original" basic vision transformer setup:
With this setup, on a V100 GPU we can only squeeze in 4 videos (4x8x3x224x224) for training even with
torch.amp
, this means if I'm doing an experiment on an p3x8 machine with 4 V100 gpus (~ 12$/h normally), it would take 39 days to do 300 epochs. Of course it may not need to train for 300 epochs, but intuitively, training with batch size = 16 is not usually not optimal.So alternatively, I tried a new model with 6 heads and 8 blocks, Now I can put 16 videos per GPU, so in total batch size = 64. The model started to train smoothly then training error increases after 7-8 epochs. The training accuracy peaked around 55% and I didn't even bother to run validation because I know clearly it's not working. Below list the relevant configuration I was using.
So these are the issues are I have encountered for now. I want to share these because hopefully some of you are actually working with video model and maybe we can have a discussion. I think probably my next thing to try is to increase number of depth.
Regards
The text was updated successfully, but these errors were encountered: