Deep Video Captioner using Stochastic Scenes
This repository contains parts of the main source code of my MS.c thesis (Defended in January of 2018), called Deep Video Captioning using Deep Recurrent Neural Networks. The document can be provided to anyone on request.
This work & source code is heavily influenced by the work of the authors of the CVPR 2017 submission #601.
The very high-level idea of this work is to try to detect scene-changes across frames in a video, and incorporate that information in the generation of the final video embedding. This is achieved using a stochastic step function defined in our model that detects sudden background changes.
This model is better or at worst-case on-par with the state of the art Video Captioning models, released before 2018, and the original model itself.
Feel free to contact me at [email protected].
-
Theano 0.9.0
-
Keras 1.1.0, configured for using Theano as backend
Note: Be sure to have
"image_dim_ordering": "th"
and"backend": "theano"
in your keras.json file.
This code comes with support to the Montreal Video Annotation Dataset (M-VAD) and to the MPII Movie Description dataset (MPII-MD).
Before doing anything, follow the instructions for the the dataset of choice, since they're steps differ.
Request access and download the dataset from the MILA website. Then create a folder datasets/M-VAD
in the root of the project, and prepare three subfolders inside it:
datasets/M-VAD/videos
. Put here all the videos, organized by movie as in the repository from MILA (for instance, you should havedatasets/M-VAD/videos/21_JUMP_STREET/video/21_JUMP_STREET_DVS20.avi
).datasets/M-VAD/annotations
. Create three subfolders here:train
,test
,val
, and put in each of them the .srt files corresponding to training (download), test (download) and validation (download) respectively.datasets/M-VAD/features
. Leave this folder empty.
Then, compute C3D and ResNet features by typing in a Python console:
from datasets import MVAD
dataset = MVAD()
dataset.compute_c3d_descriptors()
dataset.compute_resnet_descriptors()
Request access and download the dataset from the MPI website. Then create a folder datasets/MPII-MD
in the root of the project, and prepare three subfolders inside it:
datasets/MPII-MD/jpgAllFrames
. Unpack here the package with the jpeg frames as provided by MPI. For instance, you should havedatasets/MPII-MD/jpgAllFrames/0001_American_Beauty/0001_American_Beauty_00.00.51.926-00.00.54.129/0001.jpg
.datasets/MPII-MD/annotations
. Put hereannotations-someone.csv
,dataSplit.txt
anduniqueTestIds.txt
.datasets/MPII-MD/features
. Leave this folder empty.
Then, compute C3D and ResNet features by typing in a Python console:
from datasets import MPII_MD
dataset = MPII_MD()
dataset.compute_c3d_descriptors()
dataset.compute_resnet_descriptors()
Model evaluation is done using Pycoco, for which the source codes pycocoevalcap and pycocotools are included in this project. I'm sure there's a better way of incorporating them, but they do "good enough" for a one-man purely academic project.