Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

first #66

Open
wants to merge 25 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
121 changes: 22 additions & 99 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,105 +1,30 @@
# CLIP prefix captioning.

<a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-yellow.svg"></a>
Inference Notebook: <a href="https://colab.research.google.com/drive/1tuoAC5F4sC7qid56Z0ap-stR3rwdk0ZV?usp=sharing"><img src="https://colab.research.google.com/assets/colab-badge.svg" height=20></a>





## Official implementation for the paper ["ClipCap: CLIP Prefix for Image Captioning"](https://arxiv.org/abs/2111.09734)
## implementation for the paper ["ClipCap: CLIP Prefix for Image Captioning"](https://arxiv.org/abs/2111.09734)




## Description
Image captioning is a complicated task, where usually a pretrained detection network is used, requires additional supervision in the form of object annotation. We present a new approach that does not requires additional information (i.e. requires only images and captions), thus can be applied to any data. In addition, our model's training time is much faster than similar methods while achieving comparable to state-of-the-art results, even for the Conceptual Captions dataset contains over 3M images.

In our work, we use the [CLIP](https://github.com/openai/CLIP) model, which was already trained over an extremely large number of images, thus is capable of generating semantic encodings for arbitrary images without additional supervision. To produce meaningful sentences we fine-tune a pretrained language model, which has been proven to be successful for other natural language tasks. The key idea is to use the CLIP encoding as a prefix to the textual captions by employing a simple mapping network over the raw encoding, and then fine-tune our language model to generate a valid caption. In addition, we present another variant, where we utilize a transformer architecture for the mapping network and avoid the fine-tuning of GPT-2. Still, our light model achieve comaparable to state-of-the-art over nocaps dataset.

## COCO Examples

<table>
<tr>
<td><img src="Images/COCO_val2014_000000562207.jpg" ></td>
<td><img src="Images/COCO_val2014_000000165547.jpg" ></td>
<td><img src="Images/COCO_val2014_000000579664.jpg" ></td>
</tr>
<tr>
<td>A couple of people standing next to an elephant. </td>
<td>A wooden table sitting in front of a window.</td>
<td>A bunch of bananas sitting on top of a table.</td>
</tr>
</table>

<table>
<tr>
<td><img src="Images/COCO_val2014_000000060623.jpg" ></td>
<td><img src="Images/COCO_val2014_000000386164.jpg" ></td>
<td><img src="Images/COCO_val2014_000000354533.jpg" ></td>
</tr>
<tr>
<td>A woman holding a plate with a piece of cake in front of her face. </td>
<td>A wooden table topped with lots of wooden utensils.</td>
<td>A red motorcycle parked on top of a dirt field.</td>
</tr>
</table>


## Conceptual Captions Examples

<table>
<tr>
<td><img src="Images/CONCEPTUAL_01.jpg" ></td>
<td><img src="Images/CONCEPTUAL_02.jpg" ></td>
<td><img src="Images/CONCEPTUAL_03.jpg" ></td>
</tr>
<tr>
<td>3D render of a man holding a globe.</td>
<td>Students enjoing the cherry blossoms</td>
<td>Green leaf of lettuce on a white plate.</td>
</tr>
</table>

<table>
<tr>
<td><img src="Images/CONCEPTUAL_04.jpg" ></td>
<td><img src="Images/CONCEPTUAL_05.jpg" ></td>
<td><img src="Images/CONCEPTUAL_06.jpg" ></td>
</tr>
<tr>
<td>The hotel and casino on the waterfront. </td>
<td>The triangle is a symbol of the soul.</td>
<td>Cartoon boy in the bath.</td>
</tr>
</table>


## Inference Notebooks
To help visualize the results we provide a Colab notebook found in `notebooks/clip_prefix_captioning_inference.ipynb`.
The notebook will download the pretrained models and run inference on a sample images or
on images of your choosing. It is recommended to run this in [Google Colab](https://colab.research.google.com/drive/1tuoAC5F4sC7qid56Z0ap-stR3rwdk0ZV?usp=sharing).
Inference notebook for the **transformer mapping network (without fine-tune GPT-2)** can be found [here](https://colab.research.google.com/drive/180L3rMFmGujudwO1EJNF-lHIpAsAZ5xq?usp=sharing) for the COCO model (also in `notebooks/transformer_inference.ipynb`).



Both [COCO](https://drive.google.com/file/d/1IdaBtMSvtyzF0ByVaBHtvM0JYSXRExRX/view?usp=sharing) and [Conceptual Captions](https://drive.google.com/file/d/14pXWwB4Zm82rsDdvbGguLfx9F8aM7ovT/view?usp=sharing) pretrained models are available for mlp mapping network. For the transformer (without fine-tuning GPT-2) we provide [COCO](https://drive.google.com/file/d/1GYPToCqFREwi285wPLhuVExlz7DDUDfJ/view?usp=sharing) pretrained model.



## Inference GUI
1. Run it [in the browser](https://replicate.ai/rmokady/clip_prefix_caption) using replicate.ai UI.
2. Integrated to [Huggingface Spaces](https://huggingface.co/spaces) with [Gradio](https://github.com/gradio-app/gradio). See demo: [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/akhaliq/CLIP_prefix_captioning) (currently not supporting beam search)
- [ClipCap: CLIP Prefix for Image Captioning](https://arxiv.org/abs/2111.09734)
- [original ClipCap github](https://github.com/rmokady/CLIP_prefix_caption.git) : CLIP_prefix_caption

code references
- [transformers(OPT) github](https://github.com/huggingface/transformers/blob/main/src/transformers/models/opt/modeling_opt.py)
- [BLIP2](https://github.com/salesforce/BLIP.git)


## Training prerequisites

[comment]: <> (Dependencies can be found at the [Inference notebook]&#40;https://colab.research.google.com/drive/1tuoAC5F4sC7qid56Z0ap-stR3rwdk0ZV?usp=sharing&#41; )
Clone, create environment and install dependencies:
```
git clone https://github.com/rmokady/CLIP_prefix_caption && cd CLIP_prefix_caption
conda env create -f environment.yml
conda activate clip_prefix_caption
pip install -e "git+https://github.com/replicate/[email protected]#egg=cog&subdirectory=python/"
pip install transformers --upgrade
```

## COCO training
Expand All @@ -117,35 +42,33 @@ Train with fine-tuning of GPT2:
python train.py --data ./data/coco/oscar_split_ViT-B_32_train.pkl --out_dir ./coco_train/
```

__In case you want to train model with OPT, please look directly "Swith your language model from GPT-2 to OPT"__
Train only transformer mapping network:
```
python train.py --only_prefix --data ./data/coco/oscar_split_ViT-B_32_train.pkl --out_dir ./coco_train/ --mapping_type transformer --num_layres 8 --prefix_length 40 --prefix_length_clip 40
```

**If you wish to use ResNet-based CLIP:**

## Swith your language model from GPT-2 to OPT
We enabled to train your ClipCap model with OPT. We are looking forward to make this code work well with [BLIP model](https://github.com/salesforce/BLIP.git).
Training code is available at `train_OPT.py` and inference code will be updated on `predict_OPT.py`, which is basically running Predictor function in predict.py.
Please note that you manullay have to make sure your desired language model is 'facebook/opt-125m' (variable named as OPT_MODEL) on both `predict.py` and `train.py`.

```
python parse_coco.py --clip_model_type RN50x4
python train_OPT.py --data ./data/coco/oscar_split_ViT-B_32_train.pkl --out_dir /data/daisy/clipcap_output/coco_train/ --only_prefix --device
```
```
python train.py --only_prefix --data ./data/coco/oscar_split_RN50x4_train.pkl --out_dir ./coco_train/ --mapping_type transformer --num_layres 8 --prefix_length 40 --prefix_length_clip 40 --is_rn
python predict_nice.py
```

## Conceptual training
### model parallelization
- OPT-1.3b : 2-GPU, 16GB (per GPU), 1h13m per epoch
- OPT-2.7b : 3-GPU, 18GB (per GPU), 11h per epoch

Download the .TSV train/val files from [Conceptual Captions](https://ai.google.com/research/ConceptualCaptions/download) and place them under <data_root> directory.

Download the images and extract CLIP features using (outputs are `<data_root>/conceptual_clip_ViT-B_32_train.pkl` and `<data_root>/conceptual_clip_ViT-B_32_val.pkl`):
```
python parse_conceptual.py --clip_model_type ViT-B/32 --data_root <data_root> --num_threads 16
```
Notice, downloading the images might take a few days.

Train with fine-tuning of GPT2:
```
python train.py --data <data_root>/conceptual_clip_ViT-B_32_train.pkl --out_dir ./conceptual_train/
```
Similarly to the COCO training, you can train a transformer mapping network, and / or parse the images using a ResNet-based CLIP.

*latest update : 2023-04-04*

## Citation
If you use this code for your research, please cite:
Expand Down
42 changes: 42 additions & 0 deletions evaluate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
import json

from pycocoevalcap.meteor.meteor import Meteor
from pycocoevalcap.rouge.rouge import Rouge
from pycocoevalcap.cider.cider import Cider
from pycocoevalcap.spice.spice import Spice
from pycocoevalcap.tokenizer.ptbtokenizer import PTBTokenizer
from pycocoevalcap.eval import COCOEvalCap
from pycocotools.coco import COCO


# your file path
file_path = ## YOUR FILE PATH (.json)
# split your file with GroundTruth & Prediction
gt_file_name = f'./clipcap_opt27_gt.json'
gr_file_name = f'./clipcap_opt27_gr.json'

gt = {}; gr = {}
with open(file_path,'r') as f:
json_data = json.load(f)

gt["annotations"] = []; gt["images"] = []
gr["annotations"] = []; gr["images"] = []
for key, value in json_data.items():
temp_1, temp_2, temp_3 = {}, {}, {}
temp_1["image_id"] = key; temp_2["image_id"] = key
temp_1["caption"] = value[0]; temp_2["caption"] = value[1]
temp_1["id"] = key; temp_2["id"] = key; temp_3["id"] = key
gt["annotations"].append(temp_1); gt["images"].append(temp_3)
gr["annotations"].append(temp_2); gr["images"].append(temp_3)

with open(gt_file_name, 'w') as f_gt, open(gr_file_name, 'w') as f_gr:
json.dump(gt, f_gt)
json.dump(gr, f_gr)


# evaluate CIDEr, SPICE, METEOR, BLEU-4, ROUGE, BLEU-3, BLEU-2, BLEU-1 score
coco_gt = COCO(gt_file_name)
coco_pred = COCO(gr_file_name)
coco_eval = COCOEvalCap(coco_gt, coco_pred)

coco_eval.evaluate()
1 change: 1 addition & 0 deletions for_inference/nice_gt.json

Large diffs are not rendered by default.

Loading