-
Notifications
You must be signed in to change notification settings - Fork 1
/
paper.tex
302 lines (237 loc) · 18.9 KB
/
paper.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
\documentclass{article}
% if you need to pass options to natbib, use, e.g.:
\PassOptionsToPackage{numbers, compress}{natbib}
% before loading nips_2016
%
% to avoid loading the natbib package, add option nonatbib:
% \usepackage[nonatbib]{nips_2016}
%\usepackage{nips_2016}
% to compile a camera-ready version, add the [final] option, e.g.:
\usepackage[final]{nips_2016}
\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc} % use 8-bit T1 fonts
\usepackage{hyperref} % hyperlinks
\usepackage{url} % simple URL typesetting
\usepackage{booktabs} % professional-quality tables
\usepackage{amsfonts} % blackboard math symbols
\usepackage{nicefrac} % compact symbols for 1/2, etc.
\usepackage{microtype} % microtypography
\usepackage{amsmath} % equations
\usepackage{color}
\usepackage{graphicx}
\usepackage{subcaption}
\title{Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks}
% The \author macro works with any number of authors. There are two
% commands used to separate the names and addresses of multiple
% authors: \And and \AND.
%
% Using \And between authors leaves it to LaTeX to determine where to
% break the lines. Using \AND forces a line break at that point. So,
% if LaTeX puts 3 of 4 authors names on the first line, and the last
% on the second line, try using \AND instead of \And before the third
% author name.
\author{
Alberto Montes \\
Universitat Politècnica de Catalunya \\
Barcelona, Spain \\
\texttt{[email protected]} \\
\And
Amaia Salvador \\
Image Processing Group \\
Universitat Politècnica de Catalunya \\
Barcelona, Spain \\
\texttt{[email protected]} \\
\And
Santiago Pascual \\
TALP Research Center \\
Universitat Politècnica de Catalunya \\
Barcelona, Spain \\
\texttt{[email protected]} \\
\And
Xavier Giro-i-Nieto \\
Image Processing Group \\
Universitat Politècnica de Catalunya \\
Barcelona, Spain \\
\texttt{[email protected]} \\
}
\begin{document}
% \nipsfinalcopy is no longer used
\maketitle
\begin{abstract}
% AMAIA: La primera frase de l'abstract potser no cal, ja que és massa introductoria. L'he comentat i modificat l'abstract
This work proposes a simple pipeline to classify and temporally localize activities in untrimmed videos. Our system uses features from a 3D Convolutional Neural Network (C3D) as input to train a a recurrent neural network (RNN) that learns to classify video clips of 16 frames. After clip prediction, we post-process the output of the RNN to assign a single activity label to each video, and determine the temporal boundaries of the activity within the video. We show how our system can achieve competitive results in both tasks with a simple architecture. We evaluate our method in the ActivityNet Challenge 2016, achieving a 0.5874 mAP and a 0.2237 mAP in the classification and detection tasks, respectively. Our code and models are publicly available at at: \url{https://github.com/imatge-upc/activitynet-2016-cvprw}
%AMAIA: s'han d'afegir els valors de mAP (o la metrica que sigui).
%ALBERTO: FET
\end{abstract}
\section{Introduction}
Recognizing activities in videos has become a hot topic over the last years due to the continuous increase of video capturing devices and online repositories.
This large amount of data requires an automatic indexing to be accessed after capture.
The recent advances in video coding, storage and computational resources have boosted research in the field towards new and more efficient solutions for organizing and retrieving video content.
Impressive progress has been reported in the recent literature for video classification~\cite{tran2014learning,tran2015deep,wang2015towards,yao2015describing}, which requires to assign a label for the input video. While this task is already challenging, it has typically been explored with videos to be trimmed beforehand.
However, a video classification system should be able to recognize activities in untrimmed videos, and find the temporal segments in which they appear.
This second challenge has been recently proposed in the ActivityNet Challenge 2016~\cite{caba2015activitynet}, in which participants are asked to both provide a single activity for each video, as well as the temporal segment where the activity happened in the video.
% AMAIA: Amb això et refereixes que la tasca de detecció temporal es nova d'Activitynet? Ho he modificat una mica... No queda massa clar... Si és així, jo directament mencionaria la challenge i la citaria.
In order to face both these challenges at the same time, we propose a simple pipeline composed of a 3D-CNN that exploits spatial and short temporal correlations, followed by a recurrent neural network which exploits long temporal correlations.
% AMAIA: Decidim si volem dir 'CNN' o 'ConvNet'. Ara ho he deixat tot amb CNN. Millor no barrejar
\section{Related work}
% AMAIA: El related work dels datasets me'l saltaria per guanyar espai. L'he comentat.
%The activity classification and temporal localization tasks have been proposed with datasets such as Sports1M~\cite{KarpathyCVPR14}, THUMOS~\cite{THUMOS15} or ActivityNet~\cite{caba2015activitynet} published in the recents years. Many frameworks have been proposed to solve this tasks.
Several works in the literature have used 2D-CNNs to exploit the spatial correlations between frames of a video by combining their outputs using different strategies~\cite{gkioxari2015contextual,yeung2015end,ballas2015delving}. Others have tried using the optical flow as an additional input to the 2D-CNNN~\cite{wang2015towards}, which provides information of the temporal correlations. % AMAIA: Tot això s'ha de revisar...
Later on, 3D-CNNs were proposed in~\cite{tran2014learning} (known as C3D), which were able to exploit short temporal correlations between frames and have demonstrated to work remarkably well for video classification~\cite{tran2014learning,tran2015deep}. C3D have also been used for temporal detection in~\cite{scnn_shou_wang_chang_cvpr16}, where multi-stage C3D architecture is used to classify video segment proposals.
For temporal activity detection, recent works have proposed the usage of Long Short-Term Memory units (LSTM)~\cite{hochreiter1997long}.
LSTMs are a type of RNNs that are able to better exploit long and short temporal correlations in sequences, which makes them suitable for video applications.
LSTMs have been used alongside CNNs for video classification~\cite{yao2015describing} and activity localization in videos~\cite{yeung2015every}.
In this paper, we combine the capabilities of both 3D-CNNs and RNNs into a single framework. This way, we design a simple network that takes a sequence of video features from the C3D model~\cite{tran2014learning} as input to a RNN and is able to classify each one of them into an activity category.
% AMAIA: Aquí falta una frase que digui què fem diferent nosaltres, en què ens assemblem a l'estat de l'art...
\section{Proposed Architecture}
% AMAIA: He tret això dels paràgrafs, ja que ocupen massa espai i no tenim tantes coses per explicar...
% AMAIA: Ja sé que són 4 pàgines...però crec que estaria bé posar la figura de l'arquitectura no?? Potser, per aprofitar l'espai, pots fer una figura doble (o triple), i en cada columna pots posar una cosa: e.g. 1. Arquitectura (figura 1.1 de la teva memoria) 2. exemples visuals(que mostrin un frame del video, prediccio i groundtruth). La figura que també m'agrada és la 4.10 o 4.11 de la memòria (potser no cal posar-la sencera), on expliques diferents prediccions (a vegades la detecció està bé pero la classe global no, o al revés...). Jo crec que aquestes figures t'ajudaran a explicar els resultats i conclusions. https://imatge.upc.edu/web/sites/default/files/pub/xMontes.pdf
% AMAIA: Pel tema de les figures, si veus que ocupa molt pots descartar-ne alguna... depen del que vulguis explicar a la secció d'experiments
We use the C3D model proposed in~\cite{tran2014learning} to extract features for all videos in the database. We split the videos in 16-frames clips and resize them to 171$\times$128 to fit the input of the C3D model. Features from the second fully connected layer (fc6) are extracted for each video clip.
% AMAIA: Els detalls de la xarxa no calen, ja que no l'hem dissenyat nosaltres. Els comento
%This network conducts 3D convolutions/pooling which operates in spatial and temporal dimensions simultaneously, and therefore can capture both appearance and motion.
%All 3D pooling layers use a 2$\times$2 max pooling with stride 2 in spatial dimensions, the same kernel size as in the temporal dimension except the first pooling layer (\texttt{pool1}) which has kernel size 1 and stride 1. All 3D convolutional filters have kernel size 3 and stride 1 in all dimensions.
%The network architecture used is the same as in the original paper except that it has been decided to work on the next stage with the values obtained at the \texttt{fc6} layer as feature vector for each 16-frame clip. The specifications of the network layers as well as the number of kernels by layer is the following one: \texttt{conv1(64) - pool1 - conv2a(128) - pool2 - conv3a(256) - conv3b(256) - pool3 - conv4a(512) - conv4b(512) - pool4 - conv5a(512) - conv5b(512) - pool(5) - fc6(4096)}.
\subsection{Architecture}
\label{sec:architecture}
We design a network that processes a sequence of C3D-f6 features from a video, and returns a sequence of class probabilities for each 16-frames clip.
We use LSTM layers, trained with dropout with probability $p = 0.5$ and a fully connected layer with a softmax activation. Figure~\ref{fig:global_pipeline} shows the proposed architecture.
Different configurations of the number of LSTM layers $N$ and the number of cells $c$ have been tested and are compared in Section~\ref{sec:resultats}.
Our proposed system has the following architecture: \texttt{input(4096) - dropout(0.5) - N $\times$ lstm(c) - dropout(.5) - softmax(K+1)} where $K$ is the number of activity classes at the dataset.
\begin{figure}[ht]
\centering
\includegraphics[width=.5\linewidth]{img/schematic_pipeline}
\caption{Global architecture of the proposed pipeline.}
\label{fig:global_pipeline}
\end{figure}
%\paragraph{Training.}
%To train the recurrent network with the extracted features, was found that most of the videos had a clip length larger than a valid timestep at training, so a stateful approach was used at training where the memory was not reset between batches and at the same batch position was placed the continuation of the features of the previous seen video.
% AMAIA: Això de stateful i tal no cal no?
\subsection{Post-Processing}
Given a video, the prediction of our model is sequence of class probabilities for each 16-frame video clip. This output is post-processed to predict the activity class and temporally localize it.
First, to obtain the activity prediction for the whole video, we compute the average of the class probabilities over all video clips in the video.
We consider the class with maximum predicted probability as the predicted class.
%\begin{equation}
% p(x) = \frac{1}{T} \sum_i^{T} p_i(x)
%\end{equation}
%where $T$ is number of video clips of 16 frames in the video.
% AMAIA: No cal posar l'equació de la mitja...
To obtain the temporal localization of the predicted activity class, we first apply a mean filter of $k$ samples to the predicted sequence to smooth the values through time (see Equation~\ref{eq:smooth}).
Then, the probability of \textit{activity} (vs \textit{no activity}) is predicted for each 16-frames clip, being the \textit{activity} probability the sum of all probabilities of activity classes, and the \textit{no activity} probability, the one assigned to the background class.
Finally, only those clips with an \textit{activity} probability over a threshold $\gamma$ are kept and labeled with the previously predicted class.
Notice that, for each video, all predicted temporal detections are activity class.
%a threshold $\gamma$ is set and all clips with probability higher than $\gamma$ are considered as part of the detected temporal segment for the previously assigned activity.
\begin{equation}
\tilde{p}_i(x) = \frac{1}{2k} \sum_{j=i-k}^{i+k} p_i(x)
\label{eq:smooth}
\end{equation}
% \begin{equation}
% \label{eq:activity_probability}
% \tilde{p}^a_i = \sum_{x=1}^{N}\tilde{p}_i(x), \text{where } x = \begin{cases}
% 0, & \text{background class} \\
% i, & 1 \leq i \leq N \text{ activity classes}
% \end{cases}
% \end{equation}
% AMAIA: Aquesta equació està be ?? Vull dir, no cal posar where x = blabla..., és simplement la suma de les probabilitats de 1 a N no? De fet crec que tampoc cal posar-la... Ho he explicat al text. Potser podria quedar més clar...
\section{Experiments}
\subsection{Dataset}
For all our experiments we use the dataset provided in the ActivityNet Challenge 2016~\cite{caba2015activitynet}. This dataset contains 640 hours of video and 64 million frames. The ActivityNet dataset is composed of untrimmed videos, providing temporal annotations for the given ground truth class labels. The dataset is split in 50\% for training, 25\% for validation and 25\% for testing.
\subsection{Training}
We train the network described in Section~\ref{sec:architecture} with the negative log likelihood loss, assigning a lower weight to background samples to deal with dataset imbalance (see Equation~\ref{eq:loss}).
\begin{equation}
L(p,q) = - \sum_x \alpha(x) p(x) \log (q(x)), \text{ where } \alpha(x) =
\begin{cases}
\rho, & x = \text{background instance}\\
1, & \text{otherwise}
\end{cases}
\label{eq:loss}
\end{equation}
where $q$ is the predicted probability distribution and $p$ the ground truth probability distribution. In our experiments, we set $\rho = 0.3$.
The network was trained for 100 epochs, with a batch size of 256, where each sample in the minibatch is a sequence of 20 16-frame video clips. We use RMSprop~\cite{dauphin2015rmsprop} with a learning rate set to $10^{-5}$.
% AMAIA: Aquí tambe hauries d'afegir el numero de iteracions (o epochs), learning rate, optimizer, batch size. Si el tens, el temps d'entrenament en la GPU que entrenessis.
% ALBERTO: FET
\subsection{Results}
\label{sec:resultats}
We evaluate our models using the metrics proposed in ActivityNet Challenge. For video classification, we use mean average precision (mAP) and Hit@3. For temporal localization, a prediction is marked correct only when it has the correct category and has IoU with ground truth instance larger than $0.5$, and mAP is used to evaluate the performance over the entire dataset.
\begin{table}[h]
\parbox{.45\linewidth}{
\centering
\begin{tabular}{l|cc}
\textbf{Architecture} & \textbf{mAP} & \textbf{Hit@3} \\
\hline
3 x 1024-LSTM & 0.5635 & 0.7437 \\
2 x 512-LSTM & 0.5492 & 0.7364 \\
1 x 512-LSTM & \bf0.5938 & \bf0.7576 \\
\end{tabular}
\vspace{.5cm}
\caption{Results for classification task comparing different architectures.}
\label{table:classification_by_architecture}
}
\hfill
\parbox{.45\linewidth}{
\centering
\begin{tabular}{l|ccc}
\textbf{$\gamma$} & \textbf{$k=0$} & \textbf{$k=5$} & \textbf{$k=10$} \\
\hline
0.2 & 0.20732 & \bf0.22513 & 0.22136 \\
0.3 & 0.19854 & 0.22077 & 0.22100 \\
0.5 & 0.19035 & 0.21937 & 0.21302 \\
\end{tabular}
\vspace{.2cm}
\caption{mAP with an IoU threshold of $0.5$ comparing between values of $k$ and $\gamma$ on post-processing.}
\label{table:detection_postprocessing_comparison}
}
\end{table}
Table~\ref{table:classification_by_architecture} shows the performance of different network architectures. We tested configurations with different number of LSTM layers and different number of cells. These results indicate that all the networks presented high learning capacity over the data, but some over-fitting was observed with the deeper architectures, obtaining the best results with a single layer of 512-LSTM cells.
Fixing the architecture, we performed experiments for the temporal detection task using different values of $k$ and $\gamma$ in the post-processing stage.
Table~\ref{table:detection_postprocessing_comparison} shows results for the temporal activity localization task, where the effect of a mean smoothing filter can be seen, improving the localization performance.
Figures~\ref{fig:example_results_detect} and~\ref{fig:example_results_classification} show some examples of classification and temporal localization prediction for some instances of the dataset.
% AMAIA: No sé si els exemples de la figura son els millors...no en tens algun on la detecció estigui bé pero la classe no? o al reves? No en vas posar algun a la presentacio de la defensa?
% ALBERTO: Ja he ficat un exemple que passa el que tu dius
% AMAIA: Vist
\begin{figure}[ht]
\centering
\includegraphics[width=1\linewidth]{img/activity_temporal_localization_8.png}
\includegraphics[width=1\linewidth]{img/activity_temporal_localization_55.png}
\includegraphics[width=1\linewidth]{img/activity_temporal_localization_20.png}
%\includegraphics[width=1\linewidth]{img/activity_temporal_localization_35.png}
\caption{Examples of temporal activity localization predictions.}
\label{fig:example_results_detect}
\end{figure}
\begin{figure}[h]
\centering
\begin{subfigure}[b]{.5\textwidth}
\includegraphics[width=0.9\linewidth]{img/results_visualization_classification_1}
\texttt{Video ID: ArzhjEk4j\_Y \\
Ground Truth: Building sandcastles \\
\\
Prediction: \\
0.7896 Building sandcastles \\
0.0073 Doing motocross \\
0.0049 Beach soccer \\}
\end{subfigure}%
\begin{subfigure}[b]{.5\textwidth}
\includegraphics[width=0.9\linewidth]{img/results_visualization_classification_2}
\texttt{Video ID: AimG8xzchfI \\
Activity: Curling \\
\\
Prediction: \\
0.3843 Shoveling snow \\
0.1181 Ice fishing \\
0.0633 Waterskiing \\}
\end{subfigure}
\caption{Examples of activity classification.}
\label{fig:example_results_classification}
\end{figure}
% AMAIA: Crec que amb una secció on expliquis tots els experiments ja n'hi ha prou. Li he posat el nom de 'Results'
% AMAIA: Si aqui vols parlar del bias towards sport categories, hauries d'afegir la figura que tens que ho demostra.
% ALBERTO: crec que potser una segona figura no hi cab...posar-la a annexes potser???
\section{Conclusion}
In this paper we propose a simple pipeline for both classification and temporal localization of activities in videos. Our system achieves competitive results on both tasks. The sequence to sequence nature of the proposed network offers flexibility to extend it to face more challenging tasks in video processing, e.g. where more than a single activity is present in the video. Future work will address end to end training of the model (3D-CNN + RNN), to learn better feature representations suitable for the dataset.
% ALBERTO: que més explico a les conclusions???
\section*{}
{\small
\bibliographystyle{plainnat}
\bibliography{references}
}
\end{document}