rmokady · Jhryu30 · Mar 27, 2023 · Mar 27, 2023 · Mar 27, 2023 · Mar 28, 2023
diff --git a/README.md b/README.md
@@ -1,105 +1,30 @@
 # CLIP prefix captioning.
 
-<a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-yellow.svg"></a>  
-Inference Notebook: <a href="https://colab.research.google.com/drive/1tuoAC5F4sC7qid56Z0ap-stR3rwdk0ZV?usp=sharing"><img src="https://colab.research.google.com/assets/colab-badge.svg" height=20></a>  
 
 
-
-
-
-## Official implementation for the paper ["ClipCap: CLIP Prefix for Image Captioning"](https://arxiv.org/abs/2111.09734)
+## implementation for the paper ["ClipCap: CLIP Prefix for Image Captioning"](https://arxiv.org/abs/2111.09734)
 
 
 
 
 ## Description  
-Image captioning is a complicated task, where usually a pretrained detection network is used, requires additional supervision in the form of object annotation. We present a new approach that does not requires additional information (i.e. requires only images and captions), thus can be applied to any data. In addition, our model's training time is much faster than similar methods while achieving comparable to state-of-the-art results, even for the Conceptual Captions dataset contains over 3M images. 
-
-In our work, we use the [CLIP](https://github.com/openai/CLIP) model, which was already trained over an extremely large number of images, thus is capable of generating semantic encodings for arbitrary images without additional supervision. To produce meaningful sentences we fine-tune a pretrained language model, which has been proven to be successful for other natural language tasks. The key idea is to use the CLIP encoding as a prefix to the textual captions by employing a simple mapping network over the raw encoding, and then fine-tune our language model to generate a valid caption. In addition, we present another variant, where we utilize a transformer architecture for the mapping network and avoid the fine-tuning of GPT-2. Still, our light model achieve comaparable to state-of-the-art over nocaps dataset.
-
-## COCO Examples
-
-<table>
-  <tr>
-    <td><img src="Images/COCO_val2014_000000562207.jpg" ></td>
-    <td><img src="Images/COCO_val2014_000000165547.jpg" ></td>
-    <td><img src="Images/COCO_val2014_000000579664.jpg" ></td>
-  </tr>
-  <tr>
-    <td>A couple of people standing next to an elephant. </td>
-     <td>A wooden table sitting in front of a window.</td>
-     <td>A bunch of bananas sitting on top of a table.</td>
-  </tr>
- </table>
-
- <table>
-  <tr>
-    <td><img src="Images/COCO_val2014_000000060623.jpg" ></td>
-    <td><img src="Images/COCO_val2014_000000386164.jpg" ></td>
-    <td><img src="Images/COCO_val2014_000000354533.jpg" ></td>
-  </tr>
-  <tr>
-    <td>A woman holding a plate with a piece of cake in front of her face. </td>
-     <td>A wooden table topped with lots of wooden utensils.</td>
-     <td>A red motorcycle parked on top of a dirt field.</td>
-  </tr>
- </table>
-
-
-## Conceptual Captions Examples
-
-<table>
-  <tr>
-    <td><img src="Images/CONCEPTUAL_01.jpg" ></td>
-    <td><img src="Images/CONCEPTUAL_02.jpg" ></td>
-    <td><img src="Images/CONCEPTUAL_03.jpg" ></td>
-  </tr>
-  <tr>
-    <td>3D render of a man holding a globe.</td>
-     <td>Students enjoing the cherry blossoms</td>
-     <td>Green leaf of lettuce on a white plate.</td>
-  </tr>
- </table>
-
- <table>
-  <tr>
-    <td><img src="Images/CONCEPTUAL_04.jpg" ></td>
-    <td><img src="Images/CONCEPTUAL_05.jpg" ></td>
-    <td><img src="Images/CONCEPTUAL_06.jpg" ></td>
-  </tr>
-  <tr>
-    <td>The hotel and casino on the waterfront. </td>
-     <td>The triangle is a symbol of the soul.</td>
-     <td>Cartoon boy in the bath.</td>
-  </tr>
- </table>
-
-
-## Inference Notebooks
-To help visualize the results we provide a Colab notebook found in `notebooks/clip_prefix_captioning_inference.ipynb`.   
-The notebook will download the pretrained models and run inference on a sample images or 
-on images of your choosing. It is recommended to run this in [Google Colab](https://colab.research.google.com/drive/1tuoAC5F4sC7qid56Z0ap-stR3rwdk0ZV?usp=sharing).
-Inference notebook for the **transformer mapping network (without fine-tune GPT-2)** can be found [here](https://colab.research.google.com/drive/180L3rMFmGujudwO1EJNF-lHIpAsAZ5xq?usp=sharing) for the COCO model (also in `notebooks/transformer_inference.ipynb`).
-
-
-
-Both [COCO](https://drive.google.com/file/d/1IdaBtMSvtyzF0ByVaBHtvM0JYSXRExRX/view?usp=sharing) and [Conceptual Captions](https://drive.google.com/file/d/14pXWwB4Zm82rsDdvbGguLfx9F8aM7ovT/view?usp=sharing) pretrained models are available for mlp mapping network. For the transformer (without fine-tuning GPT-2) we provide [COCO](https://drive.google.com/file/d/1GYPToCqFREwi285wPLhuVExlz7DDUDfJ/view?usp=sharing) pretrained model.
-
-
-
-## Inference GUI
-1. Run it [in the browser](https://replicate.ai/rmokady/clip_prefix_caption) using replicate.ai UI.
-2. Integrated to [Huggingface Spaces](https://huggingface.co/spaces) with [Gradio](https://github.com/gradio-app/gradio). See demo: [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/akhaliq/CLIP_prefix_captioning) (currently not supporting beam search)
+- [ClipCap: CLIP Prefix for Image Captioning](https://arxiv.org/abs/2111.09734)
+- [original ClipCap github](https://github.com/rmokady/CLIP_prefix_caption.git) : CLIP_prefix_caption
+
+code references
+- [transformers(OPT) github](https://github.com/huggingface/transformers/blob/main/src/transformers/models/opt/modeling_opt.py)
+- [BLIP2](https://github.com/salesforce/BLIP.git)
 
 
 ## Training prerequisites
 
 [comment]: <> (Dependencies can be found at the [Inference notebook]&#40;https://colab.research.google.com/drive/1tuoAC5F4sC7qid56Z0ap-stR3rwdk0ZV?usp=sharing&#41; )
 Clone, create environment and install dependencies:  
 ```
-git clone https://github.com/rmokady/CLIP_prefix_caption && cd CLIP_prefix_caption
 conda env create -f environment.yml
 conda activate clip_prefix_caption
+pip install -e "git+https://github.com/replicate/[email protected]#egg=cog&subdirectory=python/"
+pip install transformers --upgrade
 ```
 
 ## COCO training
@@ -117,35 +42,33 @@ Train with fine-tuning of GPT2:
 python train.py --data ./data/coco/oscar_split_ViT-B_32_train.pkl --out_dir ./coco_train/
 ```
 
+__In case you want to train model with OPT, please look directly "Swith your language model from GPT-2 to OPT"__  
 Train only transformer mapping network:
 ```
 python train.py --only_prefix --data ./data/coco/oscar_split_ViT-B_32_train.pkl --out_dir ./coco_train/ --mapping_type transformer  --num_layres 8 --prefix_length 40 --prefix_length_clip 40
 ```
 
-**If you wish to use ResNet-based CLIP:** 
+
+## Swith your language model from GPT-2 to OPT
+We enabled to train your ClipCap model with OPT. We are looking forward to make this code work well with [BLIP model](https://github.com/salesforce/BLIP.git). 
+Training code is available at `train_OPT.py` and inference code will be updated on `predict_OPT.py`, which is basically running Predictor function in predict.py. 
+Please note that you manullay have to make sure your desired language model is 'facebook/opt-125m' (variable named as OPT_MODEL) on both `predict.py` and `train.py`.
 
 ```
-python parse_coco.py --clip_model_type RN50x4
+python train_OPT.py --data ./data/coco/oscar_split_ViT-B_32_train.pkl --out_dir /data/daisy/clipcap_output/coco_train/ --only_prefix --device
 ```
 ```
-python train.py --only_prefix --data ./data/coco/oscar_split_RN50x4_train.pkl --out_dir ./coco_train/ --mapping_type transformer  --num_layres 8 --prefix_length 40 --prefix_length_clip 40 --is_rn
+python predict_nice.py
 ```
 
-## Conceptual training
+### model parallelization
+- OPT-1.3b : 2-GPU, 16GB (per GPU), 1h13m per epoch
+- OPT-2.7b : 3-GPU, 18GB (per GPU), 11h per epoch
 
-Download the .TSV train/val files from [Conceptual Captions](https://ai.google.com/research/ConceptualCaptions/download) and place them under <data_root> directory.
 
-Download the images and extract CLIP features using (outputs are `<data_root>/conceptual_clip_ViT-B_32_train.pkl` and  `<data_root>/conceptual_clip_ViT-B_32_val.pkl`):
-```
-python parse_conceptual.py --clip_model_type ViT-B/32 --data_root <data_root> --num_threads 16
-```
-Notice, downloading the images might take a few days.
 
-Train with fine-tuning of GPT2:
-```
-python train.py --data <data_root>/conceptual_clip_ViT-B_32_train.pkl --out_dir ./conceptual_train/
-```
-Similarly to the COCO training, you can train a transformer mapping network, and / or parse the images using a ResNet-based CLIP. 
+
+*latest update : 2023-04-04*
 
 ## Citation
 If you use this code for your research, please cite:

diff --git a/evaluate.py b/evaluate.py
@@ -0,0 +1,42 @@
+import json
+
+from pycocoevalcap.meteor.meteor import Meteor
+from pycocoevalcap.rouge.rouge import Rouge
+from pycocoevalcap.cider.cider import Cider
+from pycocoevalcap.spice.spice import Spice
+from pycocoevalcap.tokenizer.ptbtokenizer import PTBTokenizer
+from pycocoevalcap.eval import COCOEvalCap
+from pycocotools.coco import COCO
+
+
+# your file path
+file_path = ## YOUR FILE PATH (.json)
+# split your file with GroundTruth & Prediction
+gt_file_name = f'./clipcap_opt27_gt.json'
+gr_file_name = f'./clipcap_opt27_gr.json'
+
+gt = {}; gr = {}
+with open(file_path,'r') as f:
+    json_data = json.load(f)
+
+gt["annotations"] = []; gt["images"] = []
+gr["annotations"] = []; gr["images"] = []
+for key, value in json_data.items():
+    temp_1, temp_2, temp_3 = {}, {}, {}
+    temp_1["image_id"] = key; temp_2["image_id"] = key
+    temp_1["caption"] = value[0]; temp_2["caption"] = value[1]
+    temp_1["id"] = key; temp_2["id"] = key; temp_3["id"] = key
+    gt["annotations"].append(temp_1); gt["images"].append(temp_3)
+    gr["annotations"].append(temp_2); gr["images"].append(temp_3)
+
+with open(gt_file_name, 'w') as f_gt, open(gr_file_name, 'w') as f_gr:
+    json.dump(gt, f_gt)
+    json.dump(gr, f_gr)
+
+
+# evaluate CIDEr, SPICE, METEOR, BLEU-4, ROUGE, BLEU-3, BLEU-2, BLEU-1 score
+coco_gt = COCO(gt_file_name)
+coco_pred = COCO(gr_file_name)
+coco_eval = COCOEvalCap(coco_gt, coco_pred)
+
+coco_eval.evaluate()
diff --git a/for_inference/nice_gt.json b/for_inference/nice_gt.json