Review — Show, Attend and Tell: Neural Image Caption Generation

With Attention, Show, Attend and Tell Outperforms Show and Tell

Show, Attend and Tell (Figure from https://zhuanlan.zhihu.com/p/32333802)

Outline

Show, Attend and Tell Network Architecture

1. CNN Encoder

2. Attention Decoder

2.1. Attention Decoder

Left: CNN Encoder, Right: Attention Decoder
Relationships between annotation vectors ai and weights αit (Figure from https://www.youtube.com/watch?v=y1S3Ri7myMg)

2.2. Stochastic “Hard” Attention & Deterministic “Soft” Attention

Soft and Hard Attention (Figure from https://www.youtube.com/watch?v=y1S3Ri7myMg)
Soft Attention (Figure from https://www.youtube.com/watch?v=y1S3Ri7myMg)
Hard Attention (Figure from https://www.youtube.com/watch?v=y1S3Ri7myMg)
Examples of soft (top) and hard (bottom) attentions

3. Experimental Results

BLEU-1,2,3,4/METEOR metrics compared to other methods,
Examples of attending to the correct object
Examples of mistakes where we can use attention to gain intuition into what the model saw

Reference

Natural Language Processing (NLP)

My Other Previous Paper Readings

PhD, Researcher. I share what I've learnt and done. :) My LinkedIn: https://www.linkedin.com/in/sh-tsang/, My Paper Reading List: https://bit.ly/33TDhxG