This is a work on exploring the use of human pose information in helping action recognition. For complete report, check here.
Implemented three different models:
- attention-pooling method (with ResNet101 backbone)
- C3D method
- two-stream method
First, we explored whether the use of human pose information would help in action recognition tasks.
Pose information is applied by weakening background using mask information in C3D architecture and directly adding joint position heat-map into top-down attention in attention-pooling method.
The results from both methods show slight improvement with human pose information:
With and without pose information in C3D architecture:
With and without pose information in attention-pooling architecture:
Since the human pose information does help with action recognition, then we need to figure out the proper way of using the pose information.
We modified two-stream architecture, and replaced the optical-flow stream with VGG16, the intuition is, by using VGG16, we could use a network pre-trained on ImageNet, instead of having the entire optical-flow stream trained from scratch, using VGG16 will also improve efficiency compared to using optical flow.
The problem is how to properly use the ImageNet pre-trained model, since the number of channel for temporal stream input is 3 * F (F being the frame number), and the ImageNet pre-trained model has input of 3 channels. We compared the results get from not loading the first layer of pre-trained model, and the results obtained by initializing the weights of first layer by replicating the weights of first layer F times. We use pre-trained ResNet101 for spatial stream and AlexNet (works better than VGG16 according to our experiments) for temporal stream. And as can be seen from the results, replicating the weights of first layer by F times greatly helped the network to converge faster and get to a better optima.
Then we also tried using the puppet mask to incorporate pose information in two-stream architecture, we weaken the context to help network learn to detect the change in pose. But as can be seen from the results, weakening background doesn’t work very well in two-stream architecture.
In conclusion, the human pose information actually help with action recognition, and we improved both the accuracy and efficiency of the two-streamed architecture by replacing the optical-flow stream with ImageNet-pretrained VGG16.
Below is the authorship information for this project.
- Author: Shangwu Yao
- Email: [email protected]
Copyright (C) 2018, Shangwu Yao. All rights reserved.