Brief Review — VinVL: Revisiting Visual Representations in Vision-Language Models

VinVL, Improving Visual Features by Better Models, Datasets & Pretraining

Sik-Ho Tsang
6 min readDec 10, 2022


First Ranking in Multiple Datasets at That Moment. (Figure from Microsoft Research Blog)

VinVL: Revisiting Visual Representations in Vision-Language Models,
, by Microsoft Corporation, and University of Washington
2021 CVPR, Over 300 Citations (Sik-Ho Tsang @ Medium)
Vision Language Model, VLM, Image Captioning, VQA

  • VinVL proposed a bigger, better-designed model for VL tasks, which is pre-trained on much larger training corpora that combine multiple public annotated object detection datasets, which can generate representations of a richer collection of visual objects and concepts.
  • While previous VLM leaves the object detection model improvement untouched, authors show that visual features matter significantly in VL models.


  1. Motivations: Improving Vision in Vision Language
  2. Object Detection Pretraining
  3. OSCAR+ Pretraining
  4. Results

1. Motivations: Improving Vision in Vision Language

Architecture for vision-language tasks, with two modules, image encoding module and vision-language fusion module. (Figure from Microsoft Research Blog)
  • Deep learning-based VL models typically consist of two modules: an image understanding module {Vision} and a cross-modal understanding module {VL}:
  • where Img and w are the inputs of the vision and language modalities, respectively. The output of the {Vision} module consists of q and v.
  • q is the semantic representation of the image, such as tags or detected objects, and v the distributional representation of the image in a high-dimensional latent space represented using e.g., the box or region features produced by a VG-pre-trained Faster R-CNN model.
  • Most VL models use only the visual features v, while the OSCAR model shows that q can serve as anchors for learning better vision-language joint representations.
  • In VQA, w is a question and y is an answer to be predicted. In text-image retrieval, w is a sentence and y is the matching score of a sentence-image pair. In image captioning, w is not given and y is a caption to be generated.

In this work, authors focus on improving {Vision} for better visual representations, developed a new Object Detection (OD) model by enriching the visual object and attribute categories, enlarging the model size and training on a much larger OD dataset.

2. Object Detection Pretraining

2.1. Pretraining Dataset

The Vision pre-training datasets.
  • Four public datasets are used in object detection pre-training, including COCO, OpenImagesV5 (OI), Objects365V1, and Visual Genome (VG).
  1. Class-aware sampling for OpenImages and Objects365 to get at least 2000 instances per class, resulting in 2.2M and 0.8M images, respectively.
  2. To balance the contribution of each dataset, we merge the four datasets with 8 copies of COCO (8×0.11M), 8 copies of VG (8×0.1M), 2 copies of class-aware sampled Objects365 (2×0.8M) and 1 copy of the class-aware sampled OpenImages (2.2M).
  3. The object vocabularies are unified.

All VG classes that contain at least 30 instances are kept, resulting in 1594 VG classes and 254 classes from the other three datasets, in total 1848 classes.

2.2. Model Architecture & Pretraining

Predictions from an X152-FPN model trained on OpenImages (Left) and X152-C4 model trained on four public object detection datasets (Right).

ResNeXt-152 C4 (X152-C4) architecture is used.

  • The convolutional head used in C4 has a better inductive bias for encoding visual information than the MLP head of FPN.
  • As seen above, X152-C4 contains much richer semantics, such as richer visual concepts and attribute information, and the detected bounding boxes cover nearly all semantically meaningful regions.
  • ImageNet-5K checkpoint is used. The first convolution layer, the first residual block, and all the batch-norm layers, are frozen. The model is trained for 1.8M iterations with a batch size of 16 images.

2.3. Injecting Attribute Information into the Model

  • Following [2], an attribute branch is added to the pretrained OD model, and then fine-tuned on VG to inject attribute information (524 classes).

2.4. Efficient Region Feature Extractor for VL tasks

  • To speed up, the class-aware NMS is replaced with the class-agnostic NMS that only conducts the NMS operation once.
  • The time-consuming conv layers are replaced with dilation=2 used in [2] with conv layers without dilation.

Finally, this pre-trained OD model serves as the image understanding module {Vision}, as in the above equation.

3. OSCAR+ Pretraining

  • OSCAR+ are trained here, to learn the joint image-text representations using image tags as anchors for image-text alignment.

3.1. Pretraining Corpus

  • The pretraining corpus based on three types of existing vision and VL datasets:
  1. Image captioning datasets with human-annotated captions as w and machine-generated image tags as q, including COCO [24], Conceptual Captions (CC) [31], SBU captions [27] and flicker30k [41].
  2. Visual QA datasets with questions as w and human-annotated answers as q, including GQA [12], VQA [8] and VG-QAs.
  3. Image tagging datasets with machine-generated captions as w and human-annotated tags as q, including a subset of OpenImages (1.67M images).

In total, the corpus contains 5.65 million unique images, 8.85 million text-tag-image triples.

3.2. Pretraining Objectives

There are two terms in the OSCAR+ pretraining loss:

  • where LMTL is the Masked Token Loss defined on the text modality (w and q), following closely OSCAR.
  • LCL3 is a novel 3-way Contrastive Loss, different from the binary contrastive loss used in OSCAR.
  • LCL3 takes into account two types of training samples x: the {caption, image-tags, image-features} triplets of the image captioning and image tagging data, and the {question, answer, image-features} triplets of the VQA data:

The inputs are polluted. A fully-connected (FC) layer is added on top of it as a 3-way classifier f(.) to predict whether the triplet is matched (c = 0), contains a polluted w (c=1), or contains a polluted q (c=2).

Effects of different pre-training contrastive losses on downstream tasks

The proposed 3-way contrastive loss transfers well to both tasks.

3.3. Pretraining Models

  • OSCAR+B and OSCAR+L, are initialized with parameters BERT of BERT base (L=12, H=768, A=12) and BERT large (L=24, H=1024, A=16).
  • To ensure that the image region features have the same input embedding size as BERT, the position-augmented region features are transformed using a linear projection via matrix W. The sequence length of language tokens [w, q] and region features v are 35 and 50, respectively.

4. Results

Uniform improvements on 7 VL tasks by replacing visual features from [2] with proposed approach.

By replacing visual features from [2] with proposed approach, the proposed OD model achieves much better results on a wide range of VL tasks.

An overall comparison with SoTAs on seven tasks.

VINVL outperforms previous SoTA models on all tasks, often by a significantly large margin.

  • (There are other results, please read the paper if interested.)
  • VinVL improves visual features in VLM by better models, datasets & pretraining.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.