Brief Review — Learning Visual Representations with Caption Annotations
ICMLM, Pretraining Using Images & Captions for Image Classification
Learning Visual Representations with Caption Annotations
ICMLM, by NAVER LABS Europe
2020 ECCV, Over 50 Citations (Sik-Ho Tsang @ Medium)
Image Captioning, Image Classification, Weakly Supervised, Object Detection
- Image-Conditioned Masked Language Modeling (ICMLM) is proposed to learn visual representations over image-caption pairs, which is a kind of weakly supervised learning.
- During pretraining, ICMLM predicts masked words in captions by relying on visual cues.
Outline
- Image-Conditioned Masked Language Modeling (ICMLM)
- Results
1. Image-Conditioned Masked Language Modeling (ICMLM)
1.1. Notations
- Dataset D={(Ii, ci)} with i from 1 to N, is composed of N image-caption pairs.
- O={ok} with k from 1 to K, is the set of concepts to be recognized in images. Binary label vectors yk to denote the presence of concepts in images. yk=1 if object appears, otherwise yk=0.
- Two parametric functions Φ and Ψ which respectively embed images and text, i.e. I to X, c to W.
Only Φ is trained which is a CNN producing visual representations, and a pretrained language model Ψ is used that is frozen during training.
1.2. ICMLM Modules
- (1) A CNN to extract visual features X; (2) a language model to extract token features W; (3), (4) and (5) respectively correspond to the proposed tfm, att + fc and tp modules.
- The TP* (TPPostag, TPCluster), ICMLMtfm and ICMLMatt-fc models combine these modules: (1)+(5), (1)+(2)+(3) and (1)+(2)+(4), respectively. These are multiple ICMLM approaches are tried by authors.
- ICMLMatt-fc has the best results almost all the time. So other approaches will be described briefly.
1.3. Capturing Image-Level Semantics (TP*)
1.3.1. TPPostag
- An off-the-shelf language parser [28] is used to determine part-of-speech (POS) tags of tokens in captions. and gather 3 label sets of size K, including (i) only nouns, (ii) nouns and adjectives, (iii) nouns, adjectives and verbs, and are used to train 3 separate TPPostag models.
1.3.2. TPCluster
- Pretrained BERTbase model is used to extract sentence-level caption representations.
- The sentence-level representations at [CLS] token of all captions are clustered using the k-means algorithm and apply hard cluster assignment.
- Φ is trained by learning to predict the cluster assignments of captions from their associated image:
1.3.3. Loss Function for TP*
- The binary label vectors are normalized to sum up to one. Then models are trained by minimizing the categorical cross-entropy:
- where:
1.4. Capturing Localized Semantics (ICMLM)
1.4.1. ICMLMtfm
- X is spatially flattened and projected to the token embedding space, concatenated with W, and goes through a Transformer encoder module tfm.
1.4.2. ICMLMatt-fc
- X and W are mapped to a common dz-dimensional space and then pairwise attention scores between visual and textual vectors are computed:
- To be able to suppress attention scores of vague tokens such as \about” or \through”, we compute soft maximum of the textual attentions for each visual feature:
- Attention probabilities are obtained by applying softmax, and used to pool X into a single visual feature ^x:
- ^x is fed into the fc module.
- Finally, the output of the fc module is mapped to the BERTbase’s token vocabulary V and compute prediction probabilities as follows:
- where:
1.4.3. Loss Function for ICMLM
- The cross-entropy loss between the probability distribution over the BERTbase’s vocabulary as computed and the label of the masked token tm, are minimized:
- ltp and lmlm are complementary, and can be combined:
2. Results
- ICMLM* models significantly improve MTP scores compared to BERTbase model, showing that visual cues are useful for MLM tasks.
ICMLMatt-fc has the best results almost all the time.
- The good results of ImageNet are mostly due to its scale.
Using VGG16 as backbone, both ICMLMtfm and ICMLMatt-fc improve over all TP* baselines by significant margins.
Using ResNet50 as backbone, both ICMLMtfm and ICMLMatt-fc improve over all TP* baselines by significant margins.
Not only the model is able to detect possible concepts of interest, it can also understand which concept is asked in the captions.
ICMLM seeks for a cheaper alternative to ground-truth labels to train visual representations.
Reference
[2020 ECCV] [ICMLM]
Learning Visual Representations with Caption Annotations
1.1. Image Classification
1989 … 2020 [ICMLM] … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP]