Efficient dual attention SlowFast networks for video action recognition

Dafeng Wei, Ye Tian, Liqing Wei, Hong Zhong, Siqian Chen, Shiliang Pu, Hongtao Lu (Corresponding authors)

Abstract

Video data mainly differ in temporal dimension compared with static image data. Various video action recognition networks choose two-stream models to learn spatial and temporal information separately and fuse them to further improve performance. We proposed a cross-modality dual attention fusion module named CMDA to explicitly exchange spatial–temporal information between two pathways in two-stream SlowFast networks. Besides, considering the computational complexity of these heavy models and the low accuracy of existing lightweight models, we proposed several two-stream efficient SlowFast networks based on well- designed efficient 2D networks, such as GhostNet, ShuffleNetV2 and so on. Experiments demonstrate that our proposed fusion model CMDA improves the performance of SlowFast, and our efficient two-stream models achieve a consistent increase in accuracy with a little overhead in FLOPs. Our code and pre-trained models will be made available at https://github.com/weidafeng/Efficient-SlowFast

Installation

Please note that the codebase is rather rough, I have been particularly busy with work recently, I will refactor the code when I have time afterwards, if you have any questions, please feel free to contact me.

You can follow the official installation tutorial of SlowFast and Efficient-3DCNNs. Or you can use the codes that I have configured for one-click installation wdf_install_slowfast.sh(recommand).

Here I list the directory structure of my enviorment, you can either place files in my format or modify the installation script yourself.

$ tree -L 1 /data1     # my root directory
/data1
├── config_slowfast    # packages used to install slowfast
├── Efficient-3DCNNs   # efficient 3d baseline
└── SlowFast_vis_0709  # slowfast main libary

$ tree -L 2 /data1/SlowFast_vis_0709/    # root directory of the SlowFast
/data1/SlowFast_vis_0709/
├── SlowFast
    ├── build
    ├── CODE_OF_CONDUCT.md
    ├── configs						# configs of each model, include Jester and Kinetics
    ├── CONTRIBUTING.md
    ├── demo						# video demo, 1) input a video, 2) select a model, 3) predict and output a result video
    ├── GETTING_STARTED.md
    ├── INSTALL.md					# official install tutorial
    ├── LICENSE
    ├── linter.sh
    ├── MODEL_ZOO.md
    ├── projects
    ├── README.md
    ├── setup.cfg
    ├── setup.py
    ├── slowfast					# main code
    ├── slowfast.egg-info
    ├── tools
    ├── wdf_all_run_scripts				# scripts used to train on AI-PLATFORM
    ├── wdf_install_slowfast.sh				# wdf's install script (recommand)
    └── wdf_visualization				# grad-cam visiualization

$tree -L 1 /data1/Efficient-3DCNNs/     # root directionary of the Efficient-3D (baseline)
..
├── annotation_Jester					
├── annotation_Kinetics					# here I provide the annotations of kinetics-400(not kinetics-600)
├── annotation_UCF101
├── calculate_FLOP.py
├── dataset.py
├── datasets
├── LICENSE
├── main.py
├── mean.py
├── model.py
├── models
├── opts.py
├── __pycache__
├── README.md
├── results-mobilenetv2-w1				# wdf trained models on kinetics-400
├── results-shufflenetv2-w025				# wdf trained models on kinetics-400
├── results-shufflenet-w2				# wdf trained models on kinetics-400
├── results-shufflev2-w1				# wdf trained models on kinetics-400
├── results-shufflev2-w2				# wdf trained models on kinetics-400
├── run-jester.sh
├── run-kinetics.sh
├── script						# scripts used to train(recommand to read)
├── spatial_transforms.py
├── speed_gpu.py
├── target_transforms.py
├── temporal_transforms.py
├── test_models.py
├── test.py
├── thop
├── train.py
├── utils
├── utils.py
└── validation.py

Then just run:

$ bash wdf_install_slowfast.sh

Supproted Models and Pretrained Checkpoints

We released both our pre-trained models and the baseline models compared in the paper.

Model Name	Hyper-Parameters	Checkpoints
SlowFastDualAttention	Same as SlowFast, including ALPHA, BETA_INV, etc.	BaiduYun(Password: kqqd)
SlowFastShuffleNet	Width=[1.0, 1.5, 2.0] , Groups=[1, 3]	BaiduYun(Password: kqqd)
SlowFastShuffleNetV2	Width=[0.25, 0.5, 1.0, 1.5, 2.0]	BaiduYun(Password: kqqd)
SlowFastMoibleNetV2	Width=[0.5, 0.7, 1.0, 2.0]	BaiduYun(Password: kqqd)
SlowFastGhostNet	Width=[1.0, 1.5, 2.0]	BaiduYun(Password: kqqd)

Supproted Datasets

Video classification:
1. Kinetics-400
2. Jester-20bn-v1
Video detection: (Untested, becase we do not have these datasets )
1. AVA
2. Charades

Pipeline of SlowFast Networks

bulid model
prepare dataset (specify the video paths and labels, no need to extract and store frames in advance.)
train or test

Quick Start

Train

To train our efficient dual-attention SlowFast networks, take SlowFastShuffleNetV2 as an example, you only need provide a config YAML file:

/data1/SlowFast_vis_0709/SlowFast$ python tools/run_net.py --cfg configs/Kinetics/SLOWFAST_SHUFFLENETV2_8x8_R50_stepwise_multigrid.yaml

The config YAML file specifies all hyper-parameters.

Test

/data1/SlowFast_vis_0709/SlowFast$ python tools/run_net.py --cfg configs/Kinetics/SLOWFAST_SHUFFLENETV2_8x8_R50_stepwise_multigrid.yaml \
TRAIN.ENABLE False \
TEST.ENABLE True \
TEST.CHECKPOINT_FILE_PATH /path/to/your/pretrained_model.pth

Visualize (Grad-CAM)

We follow the concept of Grad-CAM to visualize the salient map of the SlowFast Networks. wdf_visualization contains the codes.

Usage:

wdf_visualization$ python gradcam_video.py --help
usage: gradcam_video.py [-h] [--root_path ROOT_PATH] [--video_path VIDEO_PATH]
                        [--target_layer {s4_fuse,s5,s6,s5_fuse,s6_fuse,s8}]
                        [--yaml_cfg YAML_CFG]
                        [--checkpoint_pth CHECKPOINT_PTH]

Configs of Grad-CAM visualization.

optional arguments:
  -h, --help            show this help message and exit
  --root_path ROOT_PATH
                        root path of video
  --video_path VIDEO_PATH
                        video path
  --target_layer {s4_fuse,s5,s6,s5_fuse,s6_fuse,s8}
                        specify the layer to visualization, it should be the
                        last layer name
  --yaml_cfg YAML_CFG   yaml cfg file path
  --checkpoint_pth CHECKPOINT_PTH
                        checkpoint file path

For example:

python gradcam_video.py  \
	--yaml_cfg ../configs/Jester/SLOWFAST_MOBILENETV2_8x8_R50_stepwise_multigrid.yaml \
	--checkpoint_pth /data1/ADAS/Jester_SlowFastMoibleNetV2_W1/checkpoints/checkpoint_epoch_00100.pyth \
    --target_layer 's8' \
    --root_path /root/ \
    --video_path 20376.mp4

model_name	target_layer
ghostnet	s5
mobilenetv2	s8
shufflenet	s4_fuse
shufflenetv2	s4_fuse
dual	s5

Demo (Video in, video out)

Here you can specify the pretrained model and video path to predict and grenerate a result.mp4 video. It will output the FPS of this model.

set TRAIN.ENABLE False and TEST.ENABLE False
spticify DEMO.DATA_SOURCE to the input video path, like /data/my_video.mp4.
specitfy DEMO.OUTPUT_FILE to "" to display the predicted video, or /path/to/result.mp4 to save the result video in mp4 format.

 python tools/run_net.py --cfg configs/Kinetics/C2D_8x8_R50.yaml TRAIN.ENABLE False TEST.ENABLE False TRAIN.CHECKPOINT_FILE_PATH /data1/SlowFast/checkpoints/checkpoint_epoch_00050.pyth
 
 # SHUFFLENET W2 G3
 python tools/run_net.py --cfg demo/Jester/SLOWFAST_SHUFFLENET_8x8_R50_stepwise_multigrid.yaml TEST.CHECKPOINT_FILE_PATH  "/data1/ADAS/JESTER_SlowFastShuffle_W2_G3/checkpoints/checkpoint_epoch_00100.pyth"   DEMO.OUTPUT_FILE /root/fps_result.mp4
 
 # SHUFFLENETV2 W2
 python tools/run_net.py --cfg demo/Jester/SLOWFAST_SHUFFLENETV2_8x8_R50_stepwise_multigrid.yaml TEST.CHECKPOINT_FILE_PATH  "/data1/ADAS/JESTER_SlowFastShuffleV2_W2/checkpoints/checkpoint_epoch_00100.pyth"   DEMO.OUTPUT_FILE /root/fps_result.mp4
 
 # MOBILENETV2 W1
 python tools/run_net.py --cfg demo/Jester/SLOWFAST_MOBILENETV2_8x8_R50_stepwise_multigrid.yaml TEST.CHECKPOINT_FILE_PATH  "/data1/ADAS/Jester_SlowFastMoibleNetV2_W1_New/checkpoints/checkpoint_epoch_00080.pyth"   DEMO.OUTPUT_FILE /root/fps_result.mp4
 
 # GHOSTNET W1
 python tools/run_net.py --cfg demo/Jester/SLOWFAST_GHOSTNET_8x8_R50_stepwise_multigrid.yaml TEST.CHECKPOINT_FILE_PATH  "/data1/ADAS/Jester_SlowFastGhostNet_W1/checkpoints/checkpoint_epoch_00023.pyth"   DEMO.OUTPUT_FILE /root/fps_result.mp4

Pretrained models

We released both our pre-trained models and the baseline models compared in the paper.

BaiduYun(Password: kqqd)

Kinetics-400

Model Name	Hyper-Parameters	Acc (and baseline)
SlowFastShuffleNetV2	Width=0.25	28.79 (24.11)
SlowFastShuffleNetV2	Width=1.0	38.54 (47.26)
SlowFastShuffleNetV2	Width=2.0	48.00 (54.22)
SlowFastShuffleNet	Width=2.0 , Groups=3	53.84 (51.06)
SlowFastShuffleNet	Width=2.0 , Groups=1	54.99 (50.19)
SlowFastMoibleNetV2	Width=1.0	48.12 (38.54)
SlowFastGhostNet	Width=1.0	46.03

Jester-20BN

Model Name	Hyper-Parameters	Acc (and baseline)
SlowFastShuffleNetV2	Width=2.0	(93.71)
SlowFastShuffleNet	Width=2.0 , Groups=3	(93.54)
SlowFastMoibleNetV2	Width=1.0	(94.59)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Efficient dual attention SlowFast networks for video action recognition

Abstract

Installation

Supproted Models and Pretrained Checkpoints

Supproted Datasets

Pipeline of SlowFast Networks

Quick Start

Train

Test

Visualize (Grad-CAM)

Demo (Video in, video out)

Pretrained models

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
SlowFast		SlowFast
config_slowfast		config_slowfast
LICENSE		LICENSE
README.md		README.md

License

weidafeng/Efficient-SlowFast

Folders and files

Latest commit

History

Repository files navigation

Efficient dual attention SlowFast networks for video action recognition

Abstract

Installation

Supproted Models and Pretrained Checkpoints

Supproted Datasets

Pipeline of SlowFast Networks

Quick Start

Train

Test

Visualize (Grad-CAM)

Demo (Video in, video out)

Pretrained models

About

Resources

License

Stars

Watchers

Forks

Languages