Brief Review — Learning Visual N-Grams from Web Data

Visual N-Grams for Image Captioning & Image Retrieval

Sik-Ho Tsang
4 min readJul 30, 2022
4 high-scoring visual n-grams for 3 images in our test set according to visual n-gram model, which was trained solely on unsupervised web data

Learning Visual N-Grams from Web Data,
Visual N-Grams
, by University of Maryland, and Facebook AI Research
2017 ICCV, Over 60 Citations (

@ Medium)
Image Captioning, Image Retrieval

  • This paper explores the training of image-recognition systems on large numbers of images and associated user comments, without using manually labeled images.
  • Visual N-Grams is proposed that can predict arbitrary phrases that are relevant to the content of an image.


  • Visual N-Grams
  • Results

1. Visual N-Grams

1.1. Dataset

  • Models are trained on the YFCC100M dataset, which contains 99.2 million images and associated multi-lingual user comments.
  • Only images with English user comments are selected leaving a total of 30 million examples for training and testing.
  • Images are rescaled to 256×256 pixels (using bicubic interpolation), and cropped to the central 224×224.
  • A dictionary of all English n-grams (with n between 1 and 5) with more than 1,000 occurrences in the 30 million English comments, is used. This dictionary contains 142,806 n-grams: 22,869 unigrams, 56,830 bigrams, 32,560 trigrams, 17,351 four-grams, and 13,196 five-grams.
  • The n-gram that ends at the i-th word of comment w is denoted by w^i_(in+1) and the i-th word in comment w by w^i_i .

1.2. Naive N-Gram Loss

  • The naive n-gram loss is a standard multi-class logistic loss over all n-grams in the dictionary D.
  • The loss is summed over all n-grams that appear in the sentence w; that is, n-grams that do not appear in the dictionary are ignored:
  • where E is the n-gram embedding matrix, and the observational likelihood pobs(·) is given by a softmax distribution over all in-dictionary n-grams w that is governed by the inner product between the image features φ(I; θ) and the n-gram embeddings:
  • Image features φ(I; θ) are produced by a convolutional network φ(·).
  • The naive n-gram loss cannot do language modeling because it does not model a conditional probability. An ad-hoc conditional distribution is constructed based on the scores produced by model at prediction time using a “stupid” back-off model [6]:
  • The simple n-gram loss has two main disadvantages:
  1. It ignores out-of-dictionary n-grams entirely during training and;
  2. The parameters E that correspond to infrequent in-dictionary words are difficult to pin down.

1.3. Jelinek-Mercer (J-M) Loss

  • The loss is inspired by Jelinek-Mercer smoothing:
  • where the likelihood of a word conditioned on the (n−1) words appearing before it is defined as:
  • φ(I, θ) and E are removed for brevity. The parameter λ is a smoothing constant that governs how much of the probability mass from (n−1)-grams is (recursively) transferred to both in-dictionary and out-of-dictionary n-grams.
  • Models can learn from low-frequency and out-of-vocabulary n-grams.

2. Results

2.1. Image Tagging

Phrase-prediction performance on YFCC100M test set
  • Visual N-Gram model obtains the highest accuracy. The figure at the top shows some example.

2.2. Image Retrieval

Four highest-scoring images for n-gram queries “Market Street”, “street market”, “city park”, and “Park City” from a collection of 931, 588 YFCC100M images.
  • The model has learned accurate visual representations for n-grams such as “Market Street” and “street market”, as well as for “city park” and “Park City”.
Four highest-scoring images for n-gram queries “Washington State”, “Washington DC”, “Washington Nationals”, and “Washington Capitals” from a collection of 931, 588 YFCC100M test images.
  • The above figure shows that Visual N-Gram model is able to distinguish visual concepts related to Washington: namely, between the state, the city, the baseball team, and the hockey team.

2.3. Caption Retrieval

Caption retrieval performance on YFCC100M test set
  • The strong performance of the proposed visual n-gram models extends to caption retrieval.

Combining NLP N-gram concepts with CV image features extracted by CNN model, for image captioning and image retrieval.


[2017 ICCV] [Visual N-Grams]
Learning Visual N-Grams from Web Data

Image Captioning

2015 [m-RNN] [R-CNN+BRNN] [Show and Tell/NIC] [Show, Attend and Tell] 2017 [Visual N-Grams]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.