This project aims at training a CNN-RNN model to predict captions for a given image. The main task is to implement an effective RNN decoder for a CNN encoder.
The goal of this project is to create a neural network architecture to automatically generate captions from images. Please checkout requirements.txt
for the necessary packages required.
Important: Pytorch version 0.4.0
required.
The Microsoft Common Objects in COntext (MS COCO) dataset is used to train the neural network. The final model is then tested on novel images!
The project is structured as a series of Jupyter notebooks that are designed to be completed in sequential order:
- 0_Dataset.ipynb
- 1_Preliminaries.ipynb
- 2_Training.ipynb
- 3_Inference.ipynb and
model.py
: Network Architecture.
The network architecture consists of:
- The CNN encoder converts images into embedded feature vectors:
- The feature vector is translated into a sequence of tokens by an RNN Decoder, which is a sequential neural network made up of LSTM units:
These are some of the outputs/captions generated by the neural network on a couple of test images from test data of COCO dataset: