Brief Review — Learning Visual N-Grams from Web Data
Visual N-Grams for Image Captioning & Image Retrieval
4 min readJul 30, 2022
Learning Visual N-Grams from Web Data,
Visual N-Grams, by University of Maryland, and Facebook AI Research
2017 ICCV, Over 60 Citations (Sik-Ho Tsang @ Medium)
Image Captioning, Image Retrieval
- This paper explores the training of image-recognition systems on large numbers of images and associated user comments, without using manually labeled images.
- Visual N-Grams is proposed that can predict arbitrary phrases that are relevant to the content of an image.
Outline
- Visual N-Grams
- Results
1. Visual N-Grams
1.1. Dataset
- Models are trained on the YFCC100M dataset, which contains 99.2 million images and associated multi-lingual user comments.
- Only images with English user comments are selected leaving a total of 30 million examples for training and testing.
- Images are rescaled to 256×256 pixels (using bicubic interpolation), and cropped to the central 224×224.
- A dictionary of all English n-grams (with n between 1 and 5) with more than 1,000 occurrences in the 30 million English comments, is used. This dictionary contains 142,806 n-grams: 22,869 unigrams, 56,830 bigrams, 32,560 trigrams, 17,351 four-grams, and 13,196 five-grams.
- The n-gram that ends at the i-th word of comment w is denoted by w^i_(i−n+1) and the i-th word in comment w by w^i_i .
1.2. Naive N-Gram Loss
- The naive n-gram loss is a standard multi-class logistic loss over all n-grams in the dictionary D.
- The loss is summed over all n-grams that appear in the sentence w; that is, n-grams that do not appear in the dictionary are ignored:
- where E is the n-gram embedding matrix, and the observational likelihood pobs(·) is given by a softmax distribution over all in-dictionary n-grams w that is governed by the inner product between the image features φ(I; θ) and the n-gram embeddings:
- Image features φ(I; θ) are produced by a convolutional network φ(·).
- The naive n-gram loss cannot do language modeling because it does not model a conditional probability. An ad-hoc conditional distribution is constructed based on the scores produced by model at prediction time using a “stupid” back-off model [6]:
- The simple n-gram loss has two main disadvantages:
- It ignores out-of-dictionary n-grams entirely during training and;
- The parameters E that correspond to infrequent in-dictionary words are difficult to pin down.
1.3. Jelinek-Mercer (J-M) Loss
- The loss is inspired by Jelinek-Mercer smoothing:
- where the likelihood of a word conditioned on the (n−1) words appearing before it is defined as:
- φ(I, θ) and E are removed for brevity. The parameter λ is a smoothing constant that governs how much of the probability mass from (n−1)-grams is (recursively) transferred to both in-dictionary and out-of-dictionary n-grams.
- Models can learn from low-frequency and out-of-vocabulary n-grams.
2. Results
2.1. Image Tagging
- Visual N-Gram model obtains the highest accuracy. The figure at the top shows some example.
2.2. Image Retrieval
- The model has learned accurate visual representations for n-grams such as “Market Street” and “street market”, as well as for “city park” and “Park City”.
- The above figure shows that Visual N-Gram model is able to distinguish visual concepts related to Washington: namely, between the state, the city, the baseball team, and the hockey team.
2.3. Caption Retrieval
- The strong performance of the proposed visual n-gram models extends to caption retrieval.
Combining NLP N-gram concepts with CV image features extracted by CNN model, for image captioning and image retrieval.
Reference
[2017 ICCV] [Visual N-Grams]
Learning Visual N-Grams from Web Data
Image Captioning
2015 [m-RNN] [R-CNN+BRNN] [Show and Tell/NIC] [Show, Attend and Tell] 2017 [Visual N-Grams]