1807.02929

MM 2018

[Arxiv 1807.02929]Step-by-step Erasion, One-by-one Collection:A Weakly Supervised Temporal Action Detector [PdF] [notes]

Jia-Xing Zhong, Nannan Li, Weijie Kong, Tao Zhang, Thomas H. Li, Ge Li

read 2019/07/12

Objective

Perform temporal action localization in a weakly supervised scenario (only video-level labels are available). Train several classifiers that each focus on separate parts of clips. Progressively remove snippets that were most confidently classified as from the ground truth class.

First classifier has access to all frames, sencod only to the ones that were not the most salient for the first classifier, third to the ones that were not salient for the second (and also for the first), ...

Synthesis

Model

Erasing

erasing is based on classification outputs for all the snippets in the video
a soft mask is computed to weight the classification scores
- intra-snippet category score weighting is obtained by computing the softmax of the classification scores for a given snippet
- inter-snippet weighting is performed by normalizing a given activation by rescaling it linearly between 0 and 1 by taking into account the min and max activations for a given category accross the different snippet activations. This value is then rescaled using a discounting theshold in $\tau$ [0, 1] (the function increases linearly between 0 and $\tau$ and saturates at 1 at tau) If tau is small, a lot of videos are kept, is tau is large, a lot of videos are discarded.
the intra-snippet and inter-snippet scores are multiplied and the snippet is discarded with probability of the obtained score

Collecting

snippets are fetched according to high erasing scores from the classifiers applied in the same order as during training
post-processing is applied to smooth the results and get temporally contiguous segments. The knoweledge that neighbor snippets are likely to have the same level is levered using a fully connected conditional random field, which takes into account the information of all snippets to update the score of a given snippet. The strength of the connections are weighted by the temporal distance between the snippets, using a gaussian kernel
Compute refined probabilities using these constraints
select the clips with final score > 0.5 as final temporal detection results