Brief Review — Learning Visual Representations with Caption Annotations

ICMLM, Pretraining Using Images & Captions for Image Classification

Sik-Ho Tsang
5 min readSep 20, 2022
ICMLM: A proxy task to learn visual representations from scratch given image-caption pairs

Learning Visual Representations with Caption Annotations
, by NAVER LABS Europe
2020 ECCV, Over 50 Citations (Sik-Ho Tsang @ Medium)
Image Captioning, Image Classification, Weakly Supervised, Object Detection

  • Image-Conditioned Masked Language Modeling (ICMLM) is proposed to learn visual representations over image-caption pairs, which is a kind of weakly supervised learning.
  • During pretraining, ICMLM predicts masked words in captions by relying on visual cues.


  1. Image-Conditioned Masked Language Modeling (ICMLM)
  2. Results

1. Image-Conditioned Masked Language Modeling (ICMLM)

1.1. Notations

  • Dataset D={(Ii, ci)} with i from 1 to N, is composed of N image-caption pairs.
  • O={ok} with k from 1 to K, is the set of concepts to be recognized in images. Binary label vectors yk to denote the presence of concepts in images. yk=1 if object appears, otherwise yk=0.
  • Two parametric functions Φ and Ψ which respectively embed images and text, i.e. I to X, c to W.

Only Φ is trained which is a CNN producing visual representations, and a pretrained language model Ψ is used that is frozen during training.

1.2. ICMLM Modules

Modules used in ICMLM models Trainable (and frozen) components are colored in blue (and black)
  • (1) A CNN to extract visual features X; (2) a language model to extract token features W; (3), (4) and (5) respectively correspond to the proposed tfm, att + fc and tp modules.
  • The TP* (TPPostag, TPCluster), ICMLMtfm and ICMLMatt-fc models combine these modules: (1)+(5), (1)+(2)+(3) and (1)+(2)+(4), respectively. These are multiple ICMLM approaches are tried by authors.
  • ICMLMatt-fc has the best results almost all the time. So other approaches will be described briefly.

1.3. Capturing Image-Level Semantics (TP*)

1.3.1. TPPostag

  • An off-the-shelf language parser [28] is used to determine part-of-speech (POS) tags of tokens in captions. and gather 3 label sets of size K, including (i) only nouns, (ii) nouns and adjectives, (iii) nouns, adjectives and verbs, and are used to train 3 separate TPPostag models.

1.3.2. TPCluster

  • Pretrained BERTbase model is used to extract sentence-level caption representations.
  • The sentence-level representations at [CLS] token of all captions are clustered using the k-means algorithm and apply hard cluster assignment.
  • Φ is trained by learning to predict the cluster assignments of captions from their associated image:

1.3.3. Loss Function for TP*

  • The binary label vectors are normalized to sum up to one. Then models are trained by minimizing the categorical cross-entropy:
  • where:

1.4. Capturing Localized Semantics (ICMLM)

1.4.1. ICMLMtfm

  • X is spatially flattened and projected to the token embedding space, concatenated with W, and goes through a Transformer encoder module tfm.

1.4.2. ICMLMatt-fc

  • X and W are mapped to a common dz-dimensional space and then pairwise attention scores between visual and textual vectors are computed:
  • To be able to suppress attention scores of vague tokens such as \about” or \through”, we compute soft maximum of the textual attentions for each visual feature:
  • Attention probabilities are obtained by applying softmax, and used to pool X into a single visual feature ^x:
  • ^x is fed into the fc module.
  • Finally, the output of the fc module is mapped to the BERTbase’s token vocabulary V and compute prediction probabilities as follows:
  • where:

1.4.3. Loss Function for ICMLM

  • The cross-entropy loss between the probability distribution over the BERTbase’s vocabulary as computed and the label of the masked token tm, are minimized:
  • ltp and lmlm are complementary, and can be combined:

2. Results

Proxy vs. target task performances
  • ICMLM* models significantly improve MTP scores compared to BERTbase model, showing that visual cues are useful for MLM tasks.

ICMLMatt-fc has the best results almost all the time.

Fully-, weakly- and self-supervised methods trained with VGG16 backbones
  • The good results of ImageNet are mostly due to its scale.

Using VGG16 as backbone, both ICMLMtfm and ICMLMatt-fc improve over all TP* baselines by significant margins.

Fully- and weakly-supervised methods trained with ResNet50 backbones

Using ResNet50 as backbone, both ICMLMtfm and ICMLMatt-fc improve over all TP* baselines by significant margins.

Attention maps for masked tokens produced by ICMLMtfm model with ResNet50 backbone trained on COCO

Not only the model is able to detect possible concepts of interest, it can also understand which concept is asked in the captions.

ICMLM seeks for a cheaper alternative to ground-truth labels to train visual representations.


[2020 ECCV] [ICMLM]
Learning Visual Representations with Caption Annotations

1.1. Image Classification

1989 2020 [ICMLM] … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.