This repository contains the reference code for the paper "Rethinking Surgical Captioning: End-to-End Window-Based MLP Transformer Using Patches (MICCAI 2022)"
If you find this repo useful, please cite our paper.
- Python 3
- PyTorch 1.3+ (along with torchvision)
- cider (already been added as a submodule)
- coco-caption (already been added as a submodule) (Remember to follow initialization steps in coco-caption/README.md)
- yacs
- lmdbdict
If you have difficulty running the training scripts in tools
. You can try installing this repo as a python package:
python -m pip install -e .
- DAISI Dataset
Since we are not allowed to release the dataset, please require dataset access from the DAISI Dataset Creator. The AI-Medic: an artificial intelligent mentor for trauma surgery. It is worth highlighting that we use the cleaned DAISI Dataset from the following work: Surgical Instruction Generation with Transformers
- EndooVision18 Dataset
Please download images from endovissub2018-roboticscenesegmentation Please download the caption annotation from the CIDACaptioning.
Please follow ImageCaptioning/data/README to implement the data preprocess.
Our code is build on top of ImageCaptioning. We add our model (Swin_TranCAP, SwinMLP_TranCAP, Video_Swin_TranCAP, and Video_SwinMLP_TranCAP) into their captioning/models/, and also add the related dataloader file.
Our training config files can be found in configs folder.
Please run
$ python tools/train_vision_transformer.py --cfg configs/daisi/transformer/SwinMLP_TranCAP_L.yml --id daisi_SwinMLP_TranCAP
Similary, you can run other models by using our provided configs files.
We thank the following repos providing helpful components/functions in our work. neuraltalk2, ImageCaptioning