Skip to content

Commit

Permalink
update model_zoo: add info
Browse files Browse the repository at this point in the history
  • Loading branch information
innat committed Oct 6, 2023
1 parent 794e772 commit 5a6b500
Showing 1 changed file with 29 additions and 13 deletions.
42 changes: 29 additions & 13 deletions MODEL_ZOO.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,41 @@
# VideoMAE Model Zoo

**Note**

- `#Frame = #input_frame x #clip x #crop.`
- `#input_frame` means how many frames are input for model during the test phase.
- `#crop` means **spatial crops** (e.g., 3 for left/right/center crop).
- `#clip` means **temporal clips** (e.g., 5 means repeted temporal sampling five clips with different start indices).

The official results of torch VideoMAE finetuned with I3D dense sampling on Kinetics400 and TSN uniform sampling on Something-Something V2, respectively.


### Kinetics-400

| Method | Extra Data | Backbone | Epoch | \#Frame | Pre-train | Fine-tune | Top-1 | Top-5 |
| :------: | :--------: | :------: | :---: | :-----: | :----------------------------------------------------------: | :----------------------------------------------------------: | :---: | :---: |
| VideoMAE | ***no*** | ViT-S | 1600 | 16x5x3 | ? | ? | 79.0 | 93.8 |
| VideoMAE | ***no*** | ViT-B | 1600 | 16x5x3 | ? | ? | 81.5 | 95.1 |
| VideoMAE | ***no*** | ViT-L | 1600 | 16x5x3 | ? | ? | 85.2 | 96.8 |
| VideoMAE | ***no*** | ViT-H | 1600 | 16x5x3 | ? | ? | 86.6 | 97.1 |
For Kinetrics-400, VideoMAE is trained around **1600** epoch without **any extra data**.

| Backbone | \#Frame | Pre-train | Fine-tune | Top-1 | Top-5 |
| :------: | :-----: | :----------------------------------------------------------: | :----------------------------------------------------------: | :---: | :---: |
ViT-S | 16x5x3 | checkpoint | checkpoint | 79.0 | 93.8 |
ViT-B | 16x5x3 | checkpoint | checkpoint | 81.5 | 95.1 |
ViT-L | 16x5x3 | checkpoint | checkpoint | 85.2 | 96.8 |
ViT-H | 16x5x3 | checkpoint | checkpoint | 86.6 | 97.1 |


### Something-Something V2

| Method | Extra Data | Backbone | Epoch | \#Frame | Pre-train | Fine-tune | Top-1 | Top-5 |
| :------: | :--------: | :------: | :---: | :-----: | :----------------------------------------------------------: | :----------------------------------------------------------: | :---: | :---: |
| VideoMAE | ***no*** | ViT-S | 2400 | 16x2x3 | ? | ? | 66.8 | 90.3 |
| VideoMAE | ***no*** | ViT-B | 2400 | 16x2x3 | ? | ? | 70.8 | 92.4 |
For SSv2, VideoMAE is trained around **2400** epoch without **any extra data**.

| Backbone | \#Frame | Pre-train | Fine-tune | Top-1 | Top-5 |
| :------: | :-----: | :----------------------------------------------------------: | :----------------------------------------------------------: | :---: | :---: |
| ViT-S | 16x2x3 | ? | ? | 66.8 | 90.3 |
| ViT-B | 16x2x3 | ? | ? | 70.8 | 92.4 |


### UCF101

| Method | Extra Data | Backbone | Epoch | \#Frame | Pre-train | Fine-tune | Top-1 | Top-5 |
| :------: | :--------: | :------: | :---: | :-----: | :----------------------------------------------------------: | :----------------------------------------------------------: | :---: | :---: |
| VideoMAE | ***no*** | ViT-B | 3200 | 16x5x3 | ? | ? | 91.3 | 98.5 |
For UCF101, VideoMAE is trained around **3200** epoch without **any extra data**.

| Backbone | \#Frame | Pre-train | Fine-tune | Top-1 | Top-5 |
| :---: | :-----: | :----: | :----: | :---: | :---: |
| ViT-B | 16x5x3 | ? | ? | 91.3 | 98.5 |

0 comments on commit 5a6b500

Please sign in to comment.