Code for my final degree thesis Fast Video Object Segmentation by Pixel-Wise Feature Comparison (PiWiVOS).
This final degree thesis tackles the task of One-Shot Video Object Segmentation, where multiple objects have to be separated from the background using the ground truth masks for them in the very first frame only. Objects' large pose and scale variations throughout the sequence, alongside occlusions happening among them, make this task extremely challenging. Fast Video Object Segmentation by Pixel-Wise Feature Comparison—which is trained and tested on the well-known DAVIS dataset—goes a step further, and besides achieving comparable results with state-of-the-art methods, it works one order of magnitude faster than them, or even two in some cases.
This version of the project (updated from the original one for better reproducibility) has been build using:
- Python 3.7
- PyTorch 1.8.1 + Torchvision 0.9.1
- NumPy 1.19.5
- Pillow 8.3.0
- Scikit-image 0.16.2
There are two main scripts: train.py
, which serves to train a model; and test.py
, which is used to evaluate a
checkpoint and optionally export the predicted masks.
The complete usage can be seen typing:
$ train.py -h
This script has many arguments which control specific parts of our method. Section 4.2 of the Thesis introduces all these parameters, and Chapter 5 presents a complete study of their optimal values, in which they are set by default.
Apart from method-specific parameters, the most important arguments are:
-
--job_name JOB_NAME
: Used to identify the job and create a log directory for it atlogs/JOB_NAME
, in which tensorboard logs and checkpoints will be stored. -
--path PATH
: Path to the DAVIS dataset. Defaults todata/DAVIS
. -
--model_name ['piwivos', 'piwivosf']
: Name of the model to use. PiWiVOS uses a resnet50 backbone while PiWiVOS-F uses a resnet34 and has lower output resolution. See Chapter 5 of the Thesis for more information. Defaults to'piwivos'
.
The script trains the model from a pre-trained ResNet using the official DAVIS 2017 train
set, and validates using
the val
one.
The complete usage can be seen typing:
$ test.py -h
The main arguments are:
-
--path PATH
: Path to the DAVIS dataset. Defaults todata/DAVIS
. -
--checkpoint_path CHECKPOINT_PATH
: Path to the checkpoint file (.pth) to evaluate. Defaults tocheckpoints/piwivos/piwivos.pth
following this repository's structure. -
--model_name ['piwivos', 'piwivosf']
: Name of the model to use. Must match with the loaded checkpoint. Defaults to'piwivos'
. -
--image_set ['val', 'test-dev', 'test-challenge']
: Set of images on which to evaluate the model. Defaults to'val'
. -
--export
: When set, the script exports the predicted masks in the disk. These are stored in aresults
subdirectory side by side the evaluated checkpoint.
PiWiVOS is trained and evaluated using the DAVIS 2017 semi-supervised 480p dataset, which can be downloaded from this link.
Nonetheless, our code can also be used with different DAVIS data. In first place, our dataloader supports the DAVIS 2016 semi-supervised 480p dataset, which is a subset of the DAVIS 2017 version and contains single-object sequences, being an easier task. However, if the user wants to perform this task on the (larger) DAVIS 2017 dataset, the dataloader has also an option to merge individual object masks into a "single-object mask".
See the DAVIS dataloader.
Results reported by this repository's checkpoints are slightly better than the ones in the Thesis strictly due to seeding and possible library updates.
Model Name | J Mean | F Mean | G Mean (J&F) |
---|---|---|---|
PiWiVOS | 67.95% | 74.93% | 71.42% |
PiWiVOS-F | 56.17% | 54.46% | 55.32% |
Table: Results on the val
set. See Thesis for the original results on val
and test-dev
sets.
You can cite our work using:
@phdthesis{Palliser Sans_2019,
title={Fast Video Object Segmentation by Pixel-Wise Feature Comparison},
url={http://hdl.handle.net/2117/169370},
abstractNote={This final degree thesis tackles the task of One-Shot Video Object Segmentation, where multiple objects have to be separated from the background only having the ground truth masks for them in the very first frame. Their large pose and scale variations throughout the sequence, and the occlusions happening between them make this task very difficult to solve. Fast Video Object Segmentation by Pixel-Wise Feature Comparison goes a step further, and besides achieving comparable results with state-of-the-art methods, it works one order of magnitude faster than them, or even two in some cases.},
school={UPC, Centre de Formació Interdisciplinària Superior, Departament de Teoria del Senyal i Comunicacions},
author={Palliser Sans, Rafel},
year={2019},
month={May},
}