implementation for the paper "ClipCap: CLIP Prefix for Image Captioning"
- ClipCap: CLIP Prefix for Image Captioning
- original ClipCap github : CLIP_prefix_caption
code references
Clone, create environment and install dependencies:
conda env create -f environment.yml
conda activate clip_prefix_caption
pip install -e "git+https://github.com/replicate/[email protected]#egg=cog&subdirectory=python/"
pip install transformers --upgrade
Download train_captions to data/coco/annotations
.
Download training images and validation images and unzip (We use Karpathy et el. split).
Extract CLIP features using (output is data/coco/oscar_split_ViT-B_32_train.pkl
):
python parse_coco.py --clip_model_type ViT-B/32
Train with fine-tuning of GPT2:
python train.py --data ./data/coco/oscar_split_ViT-B_32_train.pkl --out_dir ./coco_train/
In case you want to train model with OPT, please look directly "Swith your language model from GPT-2 to OPT"
Train only transformer mapping network:
python train.py --only_prefix --data ./data/coco/oscar_split_ViT-B_32_train.pkl --out_dir ./coco_train/ --mapping_type transformer --num_layres 8 --prefix_length 40 --prefix_length_clip 40
We enabled to train your ClipCap model with OPT. We are looking forward to make this code work well with BLIP model.
Training code is available at train_OPT.py
and inference code will be updated on predict_OPT.py
, which is basically running Predictor function in predict.py.
Please note that you manullay have to make sure your desired language model is 'facebook/opt-125m' (variable named as OPT_MODEL) on both predict.py
and train.py
.
python train_OPT.py --data ./data/coco/oscar_split_ViT-B_32_train.pkl --out_dir /data/daisy/clipcap_output/coco_train/ --only_prefix --device
python predict_nice.py
- OPT-1.3b : 2-GPU, 16GB (per GPU), 1h13m per epoch
- OPT-2.7b : 3-GPU, 18GB (per GPU), 11h per epoch
latest update : 2023-04-04
If you use this code for your research, please cite:
@article{mokady2021clipcap,
title={ClipCap: CLIP Prefix for Image Captioning},
author={Mokady, Ron and Hertz, Amir and Bermano, Amit H},
journal={arXiv preprint arXiv:2111.09734},
year={2021}
}
This repository is heavily based on CLIP and Hugging-faces repositories. For training we used the data of COCO dataset and Conceptual Captions.
For any inquiry please contact us at our email addresses: [email protected] or [email protected].