Brief Review — Learning Visual Representations with Caption Annotations

ICMLM, Pretraining Using Images & Captions for Image Classification

5 min readSep 20, 2022

**ICMLM: A proxy task to learn visual representations from scratch given image-caption pairs**

Learning Visual Representations with Caption Annotations
ICMLM, by NAVER LABS Europe
2020 ECCV, Over 50 Citations (Sik-Ho Tsang @ Medium)
Image Captioning, Image Classification, Weakly Supervised, Object Detection

Image-Conditioned Masked Language Modeling (ICMLM) is proposed to learn visual representations over image-caption pairs, which is a kind of weakly supervised learning.
During pretraining, ICMLM predicts masked words in captions by relying on visual cues.

Outline

Image-Conditioned Masked Language Modeling (ICMLM)
Results

1. Image-Conditioned Masked Language Modeling (ICMLM)

1.1. Notations

Dataset D={(Ii, ci)} with i from 1 to N, is composed of N image-caption pairs.
O={ok} with k from 1 to K, is the set of concepts to be recognized in images. Binary label vectors yk to denote the presence of concepts in images. yk=1 if object appears, otherwise yk=0.
Two parametric functions Φ and Ψ which respectively embed images and text, i.e. I to X, c to W.

Only Φ is trained which is a CNN producing visual representations, and a pretrained language model Ψ is used that is frozen during training.

1.2. ICMLM Modules

**Modules used in ICMLM models** Trainable (and frozen) components are colored in blue (and black)

(1) A CNN to extract visual features X; (2) a language model to extract token features W; (3), (4) and (5) respectively correspond to the proposed tfm, att + fc and tp modules.
The TP* (TPPostag, TPCluster), ICMLMtfm and ICMLMatt-fc models combine these modules: (1)+(5), (1)+(2)+(3) and (1)+(2)+(4), respectively. These are multiple ICMLM approaches are tried by authors.
ICMLMatt-fc has the best results almost all the time. So other approaches will be described briefly.

1.3. Capturing Image-Level Semantics (TP*)

1.3.1. TPPostag

An off-the-shelf language parser [28] is used to determine part-of-speech (POS) tags of tokens in captions. and gather 3 label sets of size K, including (i) only nouns, (ii) nouns and adjectives, (iii) nouns, adjectives and verbs, and are used to train 3 separate TPPostag models.

1.3.2. TPCluster

Pretrained BERTbase model is used to extract sentence-level caption representations.
The sentence-level representations at [CLS] token of all captions are clustered using the k-means algorithm and apply hard cluster assignment.
Φ is trained by learning to predict the cluster assignments of captions from their associated image:

1.3.3. Loss Function for TP*

The binary label vectors are normalized to sum up to one. Then models are trained by minimizing the categorical cross-entropy:

where:

1.4. Capturing Localized Semantics (ICMLM)

1.4.1. ICMLMtfm

X is spatially flattened and projected to the token embedding space, concatenated with W, and goes through a Transformer encoder module tfm.

1.4.2. ICMLMatt-fc

X and W are mapped to a common dz-dimensional space and then pairwise attention scores between visual and textual vectors are computed:

To be able to suppress attention scores of vague tokens such as \about” or \through”, we compute soft maximum of the textual attentions for each visual feature:

Attention probabilities are obtained by applying softmax, and used to pool X into a single visual feature ^x:

^x is fed into the fc module.
Finally, the output of the fc module is mapped to the BERTbase’s token vocabulary V and compute prediction probabilities as follows:

where:

1.4.3. Loss Function for ICMLM

The cross-entropy loss between the probability distribution over the BERTbase’s vocabulary as computed and the label of the masked token tm, are minimized: