Cross-aware Early Fusion with Stage-divided Vision and Language Transformer Encoders for Referring Image Segmentation (Published in IEEE Transactions on Multimedia 2023)
Figure 1. Overall architecture of CrossVLT
This repo is the implementation of "Cross-aware Early Fusion with Stage-divided Vision and Language Transformer Encoders for Referring Image Segmentation" and is organized as follows:
./train.py
is implemented to train the model../test.py
is implemented to evaluate the model../refer
contains data pre-processing manual and code../data/dataset_refer_bert.py
is where the dataset class is defined../lib
contains codes implementing vision encoder and segmentation decoder../bert
contains codes migrated from Hugging Face, which implement the BERT model. We modified some codes to implement our stage-divided language encoder../CrossVLT.py
is implemented for the main network, which consists of the stage-divided vision and language encoders and simple segmentation decoder../utils.py
defines functions that track training statistics and setup functions forDistributed DataParallel
.
This repo requires Pytorch v 1.7.1 and Python 3.8. Install Pytorch v 1.7.1 with a CUDA version that works on your cluster. We used CUDA 11.0 in this repo:
conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0 -c pytorch
Then, install the packages in requirements.txt
via pip:
pip3 install -r requirements.txt
Create the ./pretrained
directory.
mkdir ./pretrained
Download ImageNet pretrained weights of the Swin Transformer and official weights of the BERT (Pytorch, uncased version) into the ./pretrained
folder.
Create the ./checkpoints
directory.
mkdir ./checkpoints
Download CrossVLT model weights into the ./checkpoints
folder.
Follow README.md in the ./refer
directory to set up subdirectories and download annotations.
Download 2014 Train images [83K/13GB] from COCO, and extract the downloaded train_2014.zip
file to ./refer/data/images/mscoco/images
.
We use DistributedDataParallel
from PyTorch. The CrossVLT were trained using 2 x 24G RTX3090 cards.
To run on multi GPUs (2 GPUs is used in this example) on a single node:
mkdir ./models
CUDA_VISIBLE_DEVICES=0,1 python3 -m torch.distributed.launch --nproc_per_node 2 --master_port 12345 train.py --dataset refcoco --swin_type base --lr 0.00003 --epochs 40 --img_size 480 2>&1 | tee ./models/refcoco
CUDA_VISIBLE_DEVICES=0,1 python3 -m torch.distributed.launch --nproc_per_node 2 --master_port 12345 train.py --dataset refcoco+ --swin_type base --lr 0.00003 --epochs 40 --img_size 480 2>&1 | tee ./models/refcoco+
CUDA_VISIBLE_DEVICES=0,1 python3 -m torch.distributed.launch --nproc_per_node 2 --master_port 12345 train.py --dataset refcocog --splitBy umd --swin_type base --lr 0.00003 --epochs 40 --img_size 480 2>&1 | tee ./models/refcocog
To store the training logs, we need to manually create the ./models
directory via mkdir
before running train.py
.
- --dataset is the dataset name. One can choose from
refcoco
,refcoco+
, andrefcocog
. - --splitBy needs to be specified if and only if the dataset is G-Ref (which is also called RefCOCOg).
- --swin_type specifies the version of the Swin Transformer. One can choose from
tiny
,small
,base
, andlarge
. The default isbase
.
To evaluate, run one of:
CUDA_VISIBLE_DEVICES=0 python3 test.py --swin_type base --dataset refcoco --split val --resume ./checkpoints/best_refcoco.pth --workers 1 --img_size 480
CUDA_VISIBLE_DEVICES=0 python3 test.py --swin_type base --dataset refcoco+ --split val --resume ./checkpoints/best_refcoco+.pth --workers 1 --img_size 480
CUDA_VISIBLE_DEVICES=0 python3 test.py --swin_type base --dataset refcocog --splitBy umd --split val --resume ./checkpoints/best_refcocog.pth --workers 1 --img_size 480
- --split is the subset to evaluate. One can choose from
val
,testA
, andtestB
for RefCOCO/RefCOCO+, andval
andtest
for G-Ref (RefCOCOg). - --resume is the path to the weights of a trained CrossVLT.
The complete evaluation results of the proposed model are summarized as follows:
Dataset | [email protected] | [email protected] | [email protected] | [email protected] | [email protected] | Mean IoU | Overall IoU |
---|---|---|---|---|---|---|---|
RefCOCO val | 85.82 | 82.80 | 77.49 | 67.21 | 35.97 | 75.48 | 73.44 |
RefCOCO test A | 88.92 | 86.25 | 81.62 | 69.79 | 36.68 | 77.54 | 76.16 |
RefCOCO test B | 81.35 | 77.74 | 72.27 | 62.63 | 37.64 | 72.69 | 70.15 |
RefCOCO+ val | 76.41 | 73.09 | 68.49 | 59.09 | 31.39 | 67.27 | 63.60 |
RefCOCO+ test A | 82.43 | 79.57 | 74.80 | 64.65 | 32.89 | 72.00 | 69.10 |
RefCOCO+ test B | 67.19 | 63.31 | 58.89 | 49.93 | 29.15 | 60.09 | 55.23 |
G-Ref val | 74.75 | 70.45 | 64.58 | 54.21 | 28.57 | 66.21 | 62.68 |
G-Ref test | 71.54 | 66.38 | 59.00 | 48.21 | 23.10 | 62.09 | 63.75 |
Comparison with ablated models (without both components (Basic model) / without cross-aware early fusion / without feature-based alignment) :
Comparison with previous state-of-the-art models (late fusion model (VLT) / vision-only early fusion model (LAVT)) :