Authors: Samrudhdhi Rangrej, Kevin Liang, Tal Hassner, James Clark Accepted to: WACV'23 Paper
An overview of our GliTr. GliTr consists of a frame-level spatial transformer
Forward pass
Forward pass
- numpy==1.19.2
- torch==1.8.1
- torchvision==0.9.1
- wandb==0.12.9
- timm==0.4.9
Prepare both datasets following instructions for Something-Something V2 dataset provided in TSM repository.
Note: Create and set following paths in SSv2_Teacher.sh, Jester_Teacher.sh, SSv2_GliTr.sh and Jester_GliTr.sh.
PRETRAINED_DIR="/absolute/path/to/directory/with/pretrained/weights/"
OUTPUT_DIR="/absolute/path/to/output/directory/"
DATA_DIR="/absolute/path/to/data/directory/"
LOG_DIR="/absolute/path/to/log/directory/"
Download and store following pretrained models in PRETRAINED_DIR
.
- ViT-S/16 teacher weights from ibot repository
- Rename it to
ibot_vits_16_checkpoint_teacher.pth
.
- Rename it to
- VideoMAE ViT-B (epoch 2400) finetuning weights for Something-Something V2 dataset from VideoMAE repository
- Rename it to
videomae_ssv2_ep2400_vitB.pth
.
- Rename it to
- Run SSv2_Teacher.sh
- Run Jester_Teacher.sh (Set
JESTER_PRETRAINED="/absolute/path/to/learnt/ssv2/teacher/weights/"
) - Run SSv2_GliTr.sh (Set
TEACHER_CHECKPOINT="/absolute/path/to/teacher/weights/"
) - Run Jester_GliTr.sh (Set
TEACHER_CHECKPOINT="/absolute/path/to/teacher/weights/"
)
Glimpses selected by GliTr on Something-Something v2 dataset. The complete frames are shown for reference only. GliTr does not observe full frames. It only observes glimpses.
Our code is based on: deit, TSM, timm, AR-Net, catalyst, VideoMAE, STAM-Sequential-Transformers-Attention-Model
Please see LICENSE.md for more details.
If you find any part of our paper or this codebase useful, please consider citing our paper:
@inproceedings{rangrej2023glitr,
title={GliTr: Glimpse Transformers with Spatiotemporal Consistency for Online Action Prediction},
author={Rangrej, Samrudhdhi B and Liang, Kevin J and Hassner, Tal and Clark, James J},
booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
pages={3413--3423},
year={2023}
}