Review — Deep Visual-Semantic Alignments for Generating Image Descriptions

CNN for Image, Bidirectional RNN for Senstences, Generate Descriptions over Images Regions

Motivation/Concept Figure: The proposed model treats language as a rich label space and generates descriptions of image regions


1. R-CNN+RNN Model Framework for Regions

CNN+RNN Model Framework

1.1. Image Representation

1.2. Sentence Representation

1.3. Alignment Objective

1.4. Decoding Text Segment Alignments to Images

Markov Random Field (MRF)

2. Multimodal RNN Model Framework for Full Image

CNN+RNN Model Framework

3. Experimental Results

3.1. Regions

Image-Sentence ranking experiment results. R@K is Recall@K (high is good). Med r is the median rank (low is good).
Example alignments predicted by the proposed model

3.2. Full Image

Evaluation of full image predictions on 1,000 test images. B-n is BLEU score that uses up to n-grams.
Example sentences generated by the multimodal RNN for test images
Example region predictions

PhD, Researcher. I share what I've learnt and done. :) My LinkedIn:, My Paper Reading List: