Review — Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)

Using CNN+RNN for Captioning, Generate Sentence from Image

Examples of the generated and two top-ranked retrieved sentences given the query image from IAPR TC-12 dataset
  • The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images, interacts with each other.


  1. Simple Recurrent Neural Network (RNN)
  2. Multimodal Recurrent Neural Network (m-RNN)
  3. Training the m-RNN
  4. Experimental Results

1. Simple Recurrent Neural Network (RNN)

Simple Recurrent Neural Network (RNN)
  • y(t), the output at time t, can be calculated as follows:

2. Multimodal Recurrent Neural Network (m-RNN)

The multimodal Recurrent Neural Network (m-RNN) architecture

2.1. Word Embedding

  • The two word embedding layers embed the one-hot input into a dense word representation. It encodes both the syntactic and semantic meaning of the words.

2.2. Recurrent Layer

  • After the two word embedding layers, there is a recurrent layer with 256 dimensions, which first map r(t-1) into the same vector space as w(t) and add them together:

2.3. Multimodal Layer

  • After the recurrent layer, there is a 512 dimensional multimodal layer that connects the language model part and the vision part of the m-RNN model.
  • This layer has three inputs: the word-embedding layer II, the recurrent layer and the image representation.
  • For the image representation, it can be the activation of the 7th layer of AlexNet or 15th layer of VGGNet.
  • The activations of the three layers are mapped to the same multimodal feature space and added together to obtain the activation of the multimodal layer:

2.4. Output Layer

  • Both the simple RNN and m-RNN models have a softmax layer that generates the probability distribution of the next word.
  • The dimension of this layer is the vocabulary size M, which is different for different datasets.

3. Training the m-RNN

  • A log-likelihood cost function is used to train the m-RNN. It is related to the Perplexity of the sentences in the training set given their corresponding images. Perplexity is a standard measure for evaluating language model. The perplexity for one word sequence (i.e. a sentence) w1:L is calculated as follows:
  • The cost function of our model is the average log-likelihood of the words given their context words and corresponding images in the training sentences plus a regularization term.
  • The training objective is to minimize this cost function, which is equivalent to maximize the probability of generating the sentences.

4. Experimental Results

4.1. Tasks & Datasets

  • 3 tasks are evaluated: Sentences generation; Image retrieval (retrieving most relevant images to the given sentence); Sentence retrieval (retrieving most relevant sentences to the given image).
  • 4 benchmark datasets with sentence level annotations: IAPR TC-12, Flickr8K, Flickr30K and MS COCO.

4.2. IAPR TC-12

Results of the sentence generation task on the IAPR TC-12 dataset. “B” is short for BLEU
  • The baseline method of RNN generates a low perplexity, its BLEU score is low, indicating that it fails to generate sentences that are consistent with the content of images.
R@K and median rank (Med r) for IAPR TC-12 dataset
  • The Med r is another metric, which is the median rank of the first retrieved groundtruth sentence or image. Lower Med r usually means better performance.
  • For the retrieval tasks, there are no publicly available results on this dataset.
  • The result shows that 20.9% top-ranked retrieved sentences and 13.2% top-ranked retrieved images are groundtruth.

4.3. Flickr8K

Results of R@K and median rank (Med r) for Flickr8K dataset
  • Even without the help from the object detection methods, i.e. R-CNN, the proposed method performs better than these methods in almost all the evaluation metrics.
  • The PPL, B-1, B-2, B-3 and B-4 of the generated sentences using our m-RNN-AlexNet model in this dataset are 24.39, 0.565, 0.386, 0.256, and 0.170 respectively. (Numbers are not in the above table)

4.4. Flickr30K & MS COCO

Results of R@K and median rank (Med r) for Flickr30K dataset and MS COCO dataset
Results of generated sentences on the Flickr 30K dataset and MS COCO dataset
Properties of the recurrent layers for the five very recent methods
  • LRCN has a stack of four 1000 dimensional LSTM layers.
Results of the MS COCO test set evaluated by MS COCO evaluation server
  • “-c5” represents results using 5 reference sentences and “-c40” represents results using 40 reference sentences.

4.5. Reranking

The original rank of the hypotheses and the rank after consensus reranking (CIDEr)
  • The reranking process is able to improve the rank of good captions.
  • (Please feel free to read the paper about the details of reranking. Also, there are still other results in the paper.)

PhD, Researcher. I share what I've learnt and done. :) My LinkedIn:, My Paper Reading List: