NeurIPS 2022, Spotlight Presentation, [arXiv
] [BibTeX
]
We propose STCAT, a new one-stage spatio-temporal video grounding method, which achieved state-of-the-art performance on VidSTG and HC-STVG benchmarks. This repository provides the Pytorch Implementations for the model training and evaluation. For more details, please refer to our paper.
The used datasets are placed in data
folder with the following structure.
data
|_ vidstg
| |_ videos
| | |_ [video name 0].mp4
| | |_ [video name 1].mp4
| | |_ ...
| |_ vstg_annos
| | |_ train.json
| | |_ ...
| |_ sent_annos
| | |_ train_annotations.json
| | |_ ...
| |_ data_cache
| | |_ ...
|_ hc-stvg
| |_ v1_video
| | |_ [video name 0].mp4
| | |_ [video name 1].mp4
| | |_ ...
| |_ annos
| | |_ hcstvg_v1
| | | |_ train.json
| | | |_ test.json
| | data_cache
| | |_ ...
You can prepare this structure with the following steps:
VidSTG
- Download the video for VidSTG from the VidOR and put it into
data/vidstg/videos
. The original video download url given by the VidOR dataset provider is broken. You can download the VidSTG videos from this. - Download the text and temporal annotations from VidSTG Repo and put it into
data/vidstg/sent_annos
. - Download the bounding-box annotations from here and put it into
data/vidstg/vstg_annos
. - For the loading efficiency, we provide the dataset cache for VidSTG at here. You can download it and put it into
data/vidstg/data_cache
.
HC-STVG
- Download the version 1 of HC-STVG videos and annotations from HC-STVG. Then put it into
data/hc-stvg/v1_video
anddata/hc-stvg/annos/hcstvg_v1
. - For the loading efficiency, we provide the dataset cache for HC-STVG at here. You can download it and put it into
data/hc-stvg/data_cache
.
The code is tested with PyTorch 1.10.0. The other versions may be compatible as well. You can install the requirements with the following commands:
conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c pytorch -c conda-forge
pip install -r requirements.txt
Then, download FFMPEG 4.1.9 and add it to the PATH
environment variable for loading the video.
Our model leveraged the ResNet-101 pretrained by MDETR as the vision backbone. Please download the pretrained weight from here and put it into data/pretrained/pretrained_resnet101_checkpoint.pth
.
Note: You should use one video per GPU during training and evaluation, more than one video per GPU is not tested and may cause some bugs.
For training on an 8-GPU node, you can use the following script:
# run for VidSTG
python3 -m torch.distributed.launch \
--nproc_per_node=8 \
scripts/train_net.py \
--config-file "experiments/VidSTG/e2e_STCAT_R101_VidSTG.yaml" \
--use-seed \
OUTPUT_DIR data/vidstg/checkpoints/output \
TENSORBOARD_DIR data/vidstg/checkpoints/output/tensorboard \
INPUT.RESOLUTION 448
# run for HC-STVG
python3 -m torch.distributed.launch \
--nproc_per_node=8 \
scripts/train_net.py \
--config-file "experiments/HC-STVG/e2e_STCAT_R101_HCSTVG.yaml" \
--use-seed \
OUTPUT_DIR data/hc-stvg/checkpoints/output \
TENSORBOARD_DIR data/hc-stvg/checkpoints/output/tensorboard \
INPUT.RESOLUTION 448
For more training options (like using other hyper-parameters), please modify the configurations experiments/VidSTG/e2e_STCAT_R101_VidSTG.yaml
and experiments/HC-STVG/e2e_STCAT_R101_HCSTVG.yaml
.
To evaluate the trained STCAT models, please run the following scripts:
# run for VidSTG
python3 -m torch.distributed.launch \
--nproc_per_node=8 \
scripts/test_net.py \
--config-file "experiments/VidSTG/e2e_STCAT_R101_VidSTG.yaml" \
--use-seed \
MODEL.WEIGHT data/vidstg/checkpoints/stcat_res448/vidstg_res448.pth \
OUTPUT_DIR data/vidstg/checkpoints/output \
INPUT.RESOLUTION 448
# run for HC-STVG
python3 -m torch.distributed.launch \
--nproc_per_node=8 \
scripts/test_net.py \
--config-file "experiments/HC-STVG/e2e_STCAT_R101_HCSTVG.yaml" \
--use-seed \
MODEL.WEIGHT data/hc-stvg/checkpoints/stcat_res448/hcstvg_res448.pth \
OUTPUT_DIR data/hc-stvg/checkpoints/output \
INPUT.RESOLUTION 448
We provide our trained checkpoints with ResNet-101 backbone for results reproducibility.
Dataset | resolution | url | Declarative (m_vIoU/[email protected]/[email protected]) | Interrogative (m_vIoU/[email protected]/[email protected]) | size |
---|---|---|---|---|---|
VidSTG | 416 | Model | 32.94/46.07/32.32 | 27.87/38.89/26.07 | 3.1GB |
VidSTG | 448 | Model | 33.14/46.20/32.58 | 28.22/39.24/26.63 | 3.1GB |
Dataset | resolution | url | m_vIoU/[email protected]/[email protected] | size |
---|---|---|---|---|
HC-STVG | 416 | Model | 34.93/56.64/31.03 | 3.1GB |
HC-STVG | 448 | Model | 35.09/57.67/30.09 | 3.1GB |
This repo is partly based on the open-source release from MDETR, DAB-DETR and MaskRCNN-Benchmark. The evaluation metric implementation is borrowed from TubeDETR for a fair comparison.
STCAT
is released under the MIT license.
Consider giving this repository a star and cite it in your publications if it helps your research.
@article{jin2022embracing,
title={Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding},
author={Jin, Yang and Li, Yongzhi and Yuan, Zehuan and Mu, Yadong},
journal={arXiv preprint arXiv:2209.13306},
year={2022}
}