This repository contains the implementation of an image captioning model using CLIP (Contrastive Language-Image Pretraining) with a ResNet-50x4 backbone. The model was developed as part of a thesis project and achieved the following performance metrics:
Model | BLEU@1 | BLEU@2 | BLEU@3 | BLEU@4 | METEOR |
---|---|---|---|---|---|
RN50x4 | 0.82 | 0.79 | 0.75 | 0.73 | 0.50 |
CLIP is a powerful vision-language pretraining model that learns joint representations of images and text. It has demonstrated state-of-the-art performance on various vision and language tasks.
The image captioning model utilizes a ResNet-50x4 backbone for feature extraction and is fine-tuned using CLIP. The training process involves learning to generate descriptive captions for images based on the joint understanding of textual and visual data.
The dataset used for training consists of 3800 images captured in public spaces, and each image is associated with four captions. This diverse dataset aims to enhance the model's ability to provide detailed and informative captions for various scenarios encountered in public environments.
To run the image captioning model, follow these steps:
-
Clone this repository:
git clone image-captioning-MM-CLIP-RN50x4-for-visually-impaired-people
-
Install transformers
!pip install transformers
-
Install CLIP
! pip install git+https://github.com/openai/CLIP.git
-
Open the Colab inference Navigate to the Colab notebook (.ipynb) and follow the steps outlined, including:
- a. Image Embedding
- b. Train
-
Output train is on .pt format. The trained model will produce output in .pt format. You can find the model weights in the output directory. The file may be named something like image_captioning_model.pt.
-
Set Up Flask Deployment
!pip install flask
- a. Create a new folder named deploy in the project directory.
- b. Move the trained model file (image_captioning_model.pt) to the deploy folder.
- c. Include your Flask deployment script (main.py) in the deploy folder.