Neural Image Caption Generation with Visual Attention [presentation] [link]
Anton Karazeev, 493 group
Good afternoon ladies and gentlemen.
Let me introduce myself. I am Anton, a third year student at MIPT, applied mathematics and physics.
Today I would like to give a general overview of a new technique called Visual Attention.
I am going to develop three main points. First, I would like to introduce in Image Caption Generation task. Secondly, I will tell you about Attention and about two types of Attention mechanism. Thirdly, I am going to reveal the quality of this new technique.
After my talk there will be time for a discussion and any questions. That is all for the introduction.
Now let’s move to the first part of my talk, which is about the article I choose and a general task of the Image Caption Generation. The most salient researcher in this team is of course Yoshua Bengio who is co-director of CIFAR Program and is known for his work on artificial intelligence and deep learning. The main task of neural network is to guess the caption for given image (this task is also called Scene understanding). Recently this task was partly solved. Researches showed in this article the method which can improve previous results and introduced Attention. Attention allows for salient features to dynamically come to the forefront as needed. Roughly speaking it allows us to highlight what the neural network sees.
So now we come to two types of Attention - “Soft” and “Hard”. “Soft” Attention is a fully differentiable deterministic mechanism while “Hard” Attention is a stochastic process. They are both used to calculate the gradients. But nowadays the trend is to focus on “Soft” Attention mechanism as the gradient can be directly computed.
At last I want to tell you about the results of this work. I start with metrics that were used to evaluate the captions generated by neural network. There are two metrics - BLEU and METEOR. They both are used to measure the correspondence between a machine’s output and that of a human in machine translation task. This table shows us that new technique of Attention outperforms all previous algorithms on every benchmark datasets.
I am approaching to the end of my talk. I will briefly summarize the main issues. I gave an explanation of Neural Image Captioning and showed the difference between “Soft” and “Hard” Attention.
In conclusion I want to mention the quote by Abraham Lincoln - “The best way to predict the future is to create it.”
Thank you for your attention. If you have any questions, I would be happy to answer them.