Skip to content

Latest commit

 

History

History

m2

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Train M2 image captioning model

Setup

mkdir datasets && cd datasets

# Download COCO caption annotations
gdown --fuzzy https://drive.google.com/file/d/1i8mqKFKhqvBr8kEp3DbIh9-9UNAfKGmE/view?usp=sharing
unzip annotations.zip
rm annotations.zip

# Download object features
wget https://www.dropbox.com/s/0h67c6ezwnderbd/oscar.hdf5
wget https://www.dropbox.com/s/hjh7shr5zvaz3gj/vinvl.hdf5

# Link cross-modal context
ln -s ../../ctx/outputs/image_features/vis_ctx.hdf5
ln -s ../../ctx/outputs/retrieved_captions/txt_ctx.hdf5

Training

The training is conducted on 1 A40 GPUs and takes approximately 4 days.

  • Train M2 + weaker Visual Genome object features + our cross-modal context on GPU #0 (or any available GPU on your machine).

    python train.py --obj_file oscar.hdf5 --devices 0
  • Train M2 + stronger VinVL object features + our cross-modal context on GPU #1 (or any available GPU on your machine).

    python train.py --obj_file vinvl.hdf5 --devices 1

Results

Using the weaker Visual Genome object features (oscar.hdf5).

Method XModal Ctx B-1 B-4 M R C S
M^2 (paper) N 80.8 39.1 29.1 58.4 131.2 22.6
M^2 (codebase) N 80.2 38.4 29.1 58.4 128.7 22.9
Ours Y 81.5 39.7 30.0 59.5 135.9 23.7

Using the stronger VinVL object features (vinvl.hdf5).

Method XModal Ctx Object
Features
B-1 B-4 M R C S
M^2 N VG 80.2 38.4 29.1 58.4 128.7 22.9
M^2 N VinVL 82.7 40.5 29.9 59.9 135.9 23.5
Ours Y VG 81.5 39.7 30.0 59.5 135.9 23.7
Ours Y VinVL 83.4 41.4 30.4 60.4 139.9 24.0

Citations

Please cite our work if you find this repo useful.

@inproceedings{kuo2022pretrained,
    title={Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning},
    author={Chia-Wen Kuo and Zsolt Kira},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    year={2022}
}

This codebase is built upon the official implementation of M2. Consider citing thier work if you find this repo useful.

@inproceedings{cornia2020m2,
    title={{Meshed-Memory Transformer for Image Captioning}},
    author={Cornia, Marcella and Stefanini, Matteo and Baraldi, Lorenzo and Cucchiara, Rita},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    year={2020}
}