Image Captioning: Generate natural sentences describing an image.
- CNN: GoogleNet, use the last hidden layer to represent an image and input to the LSTM to generate sentence. Image features only shown at the first time step of LSTM, otherwise, the model will be more prone to overfit.
- LSTM: Used to generate word based on image features and previously generated words:
Sum of the negative log likelihood of the correct word at each step:
- Use pre-trained GoogleNet to initialize CNN and fix its weights during training.
- Change parameters of embedding layer and LSTM during training.
Vinyals, Oriol, et al. "Show and tell: A neural image caption generator." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.