diff --git a/MODEL_ZOO.md b/MODEL_ZOO.md index 9426dfa..fe8f014 100644 --- a/MODEL_ZOO.md +++ b/MODEL_ZOO.md @@ -1,25 +1,41 @@ # VideoMAE Model Zoo +**Note** + +- `#Frame = #input_frame x #clip x #crop.` +- `#input_frame` means how many frames are input for model during the test phase. +- `#crop` means **spatial crops** (e.g., 3 for left/right/center crop). +- `#clip` means **temporal clips** (e.g., 5 means repeted temporal sampling five clips with different start indices). + +The official results of torch VideoMAE finetuned with I3D dense sampling on Kinetics400 and TSN uniform sampling on Something-Something V2, respectively. + + ### Kinetics-400 -| Method | Extra Data | Backbone | Epoch | \#Frame | Pre-train | Fine-tune | Top-1 | Top-5 | -| :------: | :--------: | :------: | :---: | :-----: | :----------------------------------------------------------: | :----------------------------------------------------------: | :---: | :---: | -| VideoMAE | ***no*** | ViT-S | 1600 | 16x5x3 | ? | ? | 79.0 | 93.8 | -| VideoMAE | ***no*** | ViT-B | 1600 | 16x5x3 | ? | ? | 81.5 | 95.1 | -| VideoMAE | ***no*** | ViT-L | 1600 | 16x5x3 | ? | ? | 85.2 | 96.8 | -| VideoMAE | ***no*** | ViT-H | 1600 | 16x5x3 | ? | ? | 86.6 | 97.1 | +For Kinetrics-400, VideoMAE is trained around **1600** epoch without **any extra data**. + +| Backbone | \#Frame | Pre-train | Fine-tune | Top-1 | Top-5 | + | :------: | :-----: | :----------------------------------------------------------: | :----------------------------------------------------------: | :---: | :---: | + ViT-S | 16x5x3 | checkpoint | checkpoint | 79.0 | 93.8 | + ViT-B | 16x5x3 | checkpoint | checkpoint | 81.5 | 95.1 | + ViT-L | 16x5x3 | checkpoint | checkpoint | 85.2 | 96.8 | + ViT-H | 16x5x3 | checkpoint | checkpoint | 86.6 | 97.1 | ### Something-Something V2 -| Method | Extra Data | Backbone | Epoch | \#Frame | Pre-train | Fine-tune | Top-1 | Top-5 | -| :------: | :--------: | :------: | :---: | :-----: | :----------------------------------------------------------: | :----------------------------------------------------------: | :---: | :---: | -| VideoMAE | ***no*** | ViT-S | 2400 | 16x2x3 | ? | ? | 66.8 | 90.3 | -| VideoMAE | ***no*** | ViT-B | 2400 | 16x2x3 | ? | ? | 70.8 | 92.4 | +For SSv2, VideoMAE is trained around **2400** epoch without **any extra data**. + +| Backbone | \#Frame | Pre-train | Fine-tune | Top-1 | Top-5 | +| :------: | :-----: | :----------------------------------------------------------: | :----------------------------------------------------------: | :---: | :---: | +| ViT-S | 16x2x3 | ? | ? | 66.8 | 90.3 | +| ViT-B | 16x2x3 | ? | ? | 70.8 | 92.4 | ### UCF101 -| Method | Extra Data | Backbone | Epoch | \#Frame | Pre-train | Fine-tune | Top-1 | Top-5 | -| :------: | :--------: | :------: | :---: | :-----: | :----------------------------------------------------------: | :----------------------------------------------------------: | :---: | :---: | -| VideoMAE | ***no*** | ViT-B | 3200 | 16x5x3 | ? | ? | 91.3 | 98.5 | +For UCF101, VideoMAE is trained around **3200** epoch without **any extra data**. + +| Backbone | \#Frame | Pre-train | Fine-tune | Top-1 | Top-5 | +| :---: | :-----: | :----: | :----: | :---: | :---: | +| ViT-B | 16x5x3 | ? | ? | 91.3 | 98.5 |