Brief Review — Learning Visual N-Grams from Web Data

Visual N-Grams for Image Captioning & Image Retrieval

4 min readJul 30, 2022

--

**4 high-scoring visual n-grams for 3 images in our test set according to visual n-gram model, which was trained solely on unsupervised web data**

Learning Visual N-Grams from Web Data,
Visual N-Grams, by University of Maryland, and Facebook AI Research
2017 ICCV, Over 60 Citations (Sik-Ho Tsang @ Medium)
Image Captioning, Image Retrieval

This paper explores the training of image-recognition systems on large numbers of images and associated user comments, without using manually labeled images.
Visual N-Grams is proposed that can predict arbitrary phrases that are relevant to the content of an image.

Outline

Visual N-Grams
Results

1. Visual N-Grams

1.1. Dataset

Models are trained on the YFCC100M dataset, which contains 99.2 million images and associated multi-lingual user comments.
Only images with English user comments are selected leaving a total of 30 million examples for training and testing.
Images are rescaled to 256×256 pixels (using bicubic interpolation), and cropped to the central 224×224.
A dictionary of all English n-grams (with n between 1 and 5) with more than 1,000 occurrences in the 30 million English comments, is used. This dictionary contains 142,806 n-grams: 22,869 unigrams, 56,830 bigrams, 32,560 trigrams, 17,351 four-grams, and 13,196 five-grams.
The n-gram that ends at the i-th word of comment w is denoted by w^i_(i−n+1) and the i-th word in comment w by w^i_i .

1.2. Naive N-Gram Loss

The naive n-gram loss is a standard multi-class logistic loss over all n-grams in the dictionary D.
The loss is summed over all n-grams that appear in the sentence w; that is, n-grams that do not appear in the dictionary are ignored:

where E is the n-gram embedding matrix, and the observational likelihood pobs(·) is given by a softmax distribution over all in-dictionary n-grams w that is governed by the inner product between the image features φ(I; θ) and the n-gram embeddings:

Image features φ(I; θ) are produced by a convolutional network φ(·).
The naive n-gram loss cannot do language modeling because it does not model a conditional probability. An ad-hoc conditional distribution is constructed based on the scores produced by model at prediction time using a “stupid” back-off model [6]:

The simple n-gram loss has two main disadvantages:

It ignores out-of-dictionary n-grams entirely during training and;
The parameters E that correspond to infrequent in-dictionary words are difficult to pin down.

1.3. Jelinek-Mercer (J-M) Loss

The loss is inspired by Jelinek-Mercer smoothing:

where the likelihood of a word conditioned on the (n−1) words appearing before it is defined as:

φ(I, θ) and E are removed for brevity. The parameter λ is a smoothing constant that governs how much of the probability mass from (n−1)-grams is (recursively) transferred to both in-dictionary and out-of-dictionary n-grams.
Models can learn from low-frequency and out-of-vocabulary n-grams.

2. Results

2.1. Image Tagging

**Phrase-prediction performance on YFCC100M test set**

Visual N-Gram model obtains the highest accuracy. The figure at the top shows some example.

2.2. Image Retrieval

**Four highest-scoring images for n-gram queries “Market Street”, “street market”, “city park”, and “Park City” from a collection of 931, 588 YFCC100M images.**

The model has learned accurate visual representations for n-grams such as “Market Street” and “street market”, as well as for “city park” and “Park City”.

Four highest-scoring images for n-gram queries “Washington State”, “Washington DC”, “Washington Nationals”, and “Washington Capitals” from a collection of 931, 588 YFCC100M test images.

The above figure shows that Visual N-Gram model is able to distinguish visual concepts related to Washington: namely, between the state, the city, the baseball team, and the hockey team.