diff --git a/MODEL_ZOO.md b/MODEL_ZOO.md
index 9426dfa..fe8f014 100644
--- a/MODEL_ZOO.md
+++ b/MODEL_ZOO.md
@@ -1,25 +1,41 @@
 # VideoMAE Model Zoo
 
+**Note**
+
+- `#Frame = #input_frame x #clip x #crop.`
+- `#input_frame` means how many frames are input for model during the test phase.
+- `#crop` means **spatial crops** (e.g., 3 for left/right/center crop).
+- `#clip` means **temporal clips** (e.g., 5 means repeted temporal sampling five clips with different start indices).
+
+The official results of torch VideoMAE finetuned with I3D dense sampling on Kinetics400 and TSN uniform sampling on Something-Something V2, respectively.
+
+
 ### Kinetics-400
 
-|  Method  | Extra Data | Backbone | Epoch | \#Frame |                          Pre-train                           |                          Fine-tune                           | Top-1 | Top-5 |
-| :------: | :--------: | :------: | :---: | :-----: | :----------------------------------------------------------: | :----------------------------------------------------------: | :---: | :---: |
-| VideoMAE |  ***no***  |  ViT-S   |  1600  | 16x5x3  | ? | ? | 79.0 | 93.8 |
-| VideoMAE |  ***no***  |  ViT-B   | 1600  | 16x5x3  | ? | ? | 81.5  | 95.1  |
-| VideoMAE |  ***no***  |  ViT-L   | 1600  | 16x5x3  | ? | ? | 85.2  | 96.8  |
-| VideoMAE |  ***no***  |  ViT-H   |  1600  | 16x5x3  | ? | ? | 86.6 | 97.1 |
+For Kinetrics-400, VideoMAE is trained around **1600** epoch without **any extra data**.
+
+| Backbone | \#Frame |                          Pre-train                           |                          Fine-tune                           | Top-1 | Top-5 |
+ | :------: | :-----: | :----------------------------------------------------------: | :----------------------------------------------------------: | :---: | :---: |
+  ViT-S    | 16x5x3  | checkpoint | checkpoint | 79.0 | 93.8   |
+  ViT-B    | 16x5x3  | checkpoint | checkpoint | 81.5  | 95.1  |
+  ViT-L    | 16x5x3  | checkpoint | checkpoint | 85.2  | 96.8  |
+  ViT-H    | 16x5x3  | checkpoint | checkpoint | 86.6 | 97.1   |
 
 
 ### Something-Something V2
 
-|  Method  | Extra Data | Backbone | Epoch | \#Frame |                          Pre-train                           |                          Fine-tune                           | Top-1 | Top-5 |
-| :------: | :--------: | :------: | :---: | :-----: | :----------------------------------------------------------: | :----------------------------------------------------------: | :---: | :---: |
-| VideoMAE |  ***no***  |  ViT-S   | 2400  | 16x2x3  | ? | ? | 66.8 | 90.3 |
-| VideoMAE |  ***no***  |  ViT-B   | 2400  | 16x2x3  | ? | ? | 70.8  | 92.4  |
+For SSv2, VideoMAE is trained around **2400** epoch without **any extra data**.
+
+| Backbone | \#Frame |                          Pre-train                           |                          Fine-tune                           | Top-1 | Top-5 |
+| :------: | :-----: | :----------------------------------------------------------: | :----------------------------------------------------------: | :---: | :---: |
+|  ViT-S    | 16x2x3  | ? | ? | 66.8 | 90.3 |
+|  ViT-B    | 16x2x3  | ? | ? | 70.8  | 92.4  |
 
 
 ### UCF101
 
-|  Method  | Extra Data | Backbone | Epoch | \#Frame |                          Pre-train                           |                          Fine-tune                           | Top-1 | Top-5 |
-| :------: | :--------: | :------: | :---: | :-----: | :----------------------------------------------------------: | :----------------------------------------------------------: | :---: | :---: |
-| VideoMAE |  ***no***  |  ViT-B   |  3200  | 16x5x3  | ? | ? | 91.3  | 98.5  |
+For UCF101, VideoMAE is trained around **3200** epoch without **any extra data**.
+
+| Backbone | \#Frame |  Pre-train  |  Fine-tune   | Top-1 | Top-5 |
+| :---: | :-----: | :----: | :----: | :---: | :---: |
+|  ViT-B   |  16x5x3  | ?  | ? | 91.3 |  98.5 |