Name		Name	Last commit message	Last commit date
parent directory ..
data		data
evaluation		evaluation
models		models
utils		utils
vocab		vocab
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
train.py		train.py

README.md

Train M² image captioning model

Setup

mkdir datasets && cd datasets

# Download COCO caption annotations
gdown --fuzzy https://drive.google.com/file/d/1i8mqKFKhqvBr8kEp3DbIh9-9UNAfKGmE/view?usp=sharing
unzip annotations.zip
rm annotations.zip

# Download object features
wget https://www.dropbox.com/s/0h67c6ezwnderbd/oscar.hdf5
wget https://www.dropbox.com/s/hjh7shr5zvaz3gj/vinvl.hdf5

# Link cross-modal context
ln -s ../../ctx/outputs/image_features/vis_ctx.hdf5
ln -s ../../ctx/outputs/retrieved_captions/txt_ctx.hdf5

Training

The training is conducted on 1 A40 GPUs and takes approximately 4 days.

Train M² + weaker Visual Genome object features + our cross-modal context on GPU #0 (or any available GPU on your machine).
```
python train.py --obj_file oscar.hdf5 --devices 0
```
Train M² + stronger VinVL object features + our cross-modal context on GPU #1 (or any available GPU on your machine).
```
python train.py --obj_file vinvl.hdf5 --devices 1
```

Results

Using the weaker Visual Genome object features (oscar.hdf5).

Method	XModal Ctx	B-1	B-4	M	R	C	S
M^2 (paper)	N	80.8	39.1	29.1	58.4	131.2	22.6
M^2 (codebase)	N	80.2	38.4	29.1	58.4	128.7	22.9
Ours	Y	81.5	39.7	30.0	59.5	135.9	23.7

Using the stronger VinVL object features (vinvl.hdf5).

Method	XModal Ctx	Object Features	B-1	B-4	M	R	C	S
M^2	N	VG	80.2	38.4	29.1	58.4	128.7	22.9
M^2	N	VinVL	82.7	40.5	29.9	59.9	135.9	23.5
Ours	Y	VG	81.5	39.7	30.0	59.5	135.9	23.7
Ours	Y	VinVL	83.4	41.4	30.4	60.4	139.9	24.0

Citations

Please cite our work if you find this repo useful.

@inproceedings{kuo2022pretrained,
    title={Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning},
    author={Chia-Wen Kuo and Zsolt Kira},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    year={2022}
}

This codebase is built upon the official implementation of M². Consider citing thier work if you find this repo useful.

@inproceedings{cornia2020m2,
    title={{Meshed-Memory Transformer for Image Captioning}},
    author={Cornia, Marcella and Stefanini, Matteo and Baraldi, Lorenzo and Cucchiara, Rita},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    year={2020}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

m2

m2

README.md

Train M² image captioning model

Setup

Training

Results

Citations

Files

m2

Directory actions

More options

Directory actions

More options

Latest commit

History

m2

Folders and files

parent directory

README.md

Train M2 image captioning model

Setup

Training

Results

Citations

Train M² image captioning model