Review — VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

VLMo, VLM or Foundation Model, Using Multiway Transformer, Inspired by MoE and Switch Transformer

Sik-Ho Tsang
7 min readMay 26


VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts,
VLMo, by Harbin Institute of Technology, and Microsoft Corporation,
2022 NeurIPS, Over 90 Citations (Sik-Ho Tsang @ Medium)

Vision Language Model (VLM) / Foundation Model
2017 … 2021
[CLIP] [VinVL] [ALIGN] [VirTex] [ALBEF] [Conceptual 12M (CC12M)] 2022 [FILIP] [Wukong] [LiT] [Flamingo] [FLAVA] [SimVLM] 2023 [GPT-4]
==== My Other Paper Readings Are Also Over Here ====

  • A unified Vision-Language pretrained Model (VLMo) is proposed, which jointly learns a dual encoder and a fusion encoder with a modular Transformer network.
  • Within VLMo, Multiway Transformer is proposed, where each block contains a pool of modality-specific experts and a shared self-attention layer.
  • The pretrained VLMO can be fine-tuned as a fusion encoder for vision-language classification tasks, or used as a dual encoder for efficient image-text retrieval.


  1. VLMo Model Architecture
  2. VLMo Pretraining
  3. VLMo Fine-Tuning
  4. Results

1. VLMo Model Architecture

  • Given an image-text pair, the pair is encoded into image, text and image-text vector representations. These representations are then fed into the Multiway Transformer to learn contextualized representations and align image and text feature vectors.

1.1. Image Representations

  • Following ViT, The 2D image v is split and reshaped into N patches vp.
  • The image patches are then flattened into vectors and are linearly projected to obtain patch embeddings.
  • A learnable special token [I_CLS] is prepended to the sequence.

Finally, image input representations are obtained via summing patch embeddings, learnable 1D position embeddings Vpos and image type embedding Vtype:

1.2. Text Representations

  • Following BERT, text is tokenized to subword units by WordPiece [49].
  • A start-of-sequence token ([T_CLS]) and a special boundary token ([T_SEP]) are added to the text sequence.

Text input representations Hw0 are computed via summing the corresponding word embedding, text position embedding and text type embedding:

1.3. Image-Text Representations

Image and text input vectors are concatenated to form the image-text input representations:

1.4. Multiway Transformer

Overview of VLMo pre-training.
  • Inspired by mixture-of-experts networks (MoE and Switch Transformer), a general-purpose multimodal Transformer is proposed for vision-language tasks, namely Multiway Transformer, to encode different modalities.
  • Each modality expert is also the feed forward network (FFN) which consists of two linear transformations and an activation.
  • Given previous layer’s output vectors Hl−1, l ∈ [1, L], each Multiway Transformer block captures modality-specific information by switching to different modality expert, and employs multi-head self-attention (MSA) shared across modalities to align visual and linguistic contents. LN is short for Layer Normalization:
  • where Multiway-FFN selects an expert among multiple modality experts to process the input according to the modality of the input vectors Hl and the index of the Transformer layer.
  • There are are three modality experts: vision expert (V-FFN), language expert (L-FFN) and vision-language expert (VL-FFN), as shown in the figure above.

If the input is image-only or text-only vectors, vision expert is used for encoding images and language expert for encoding text.

If the input consists of vectors of multiple modalities, such as the vectors of image-text pair, vision expert and language expert are employed to encode the respective modality vectors at the bottom Transformer layers. Vision-language expert is then used at the top layers to capture more modality interaction.

1.5. Model Variants

VLMo-Base consists of 12-layer Transformer blocks with 768 hidden size and 12 attention heads.

VLMo-Large is a 24-layer Transformer network with 1024 hidden size and 16 attention heads.

  • VLMo-Base uses vision-language expert on the top two Transformer layers, and VLMo-Large introduces vision-language expert on the top three Transformer layers.
  • VLMo-Base consists of 175M parameters and VLMo-Large contains 562M parameters.

2. VLMo Pretraining

Stagewise pre-training using image-only and text-only corpora.
  • VLMO is jointly pretrained by image-text contrastive learning on the image and text representations, masked language modeling and image-text matching on the image-text pair representations with shared parameters.

2.1. Image-Text Contrast

  • Given a batch of N image-text pairs, image-text contrastive learning aims to predict the matched pairs from N×N possible image-text pairs. There are N²−N negative image-text pairs within a training batch.
  • The final output vectors of [I_CLS] token and [T_CLS] token are used as the aggregated representation of the image and text, respectively.
  • Followed by a linear projection and normalization, image vectors {ˆhvi} and text vectors {ˆhwi} are obtained in a training batch to compute image-to-text and text-to-image similarities:
  • pi2ti and pt2ii are the softmax-normalized similarities.

2.2. Masked Language Modeling

  • Following BERT, tokens are randomly chosen in the text sequence, and replaced with the [MASK] token. 15% masking probability is used.
  • The final output vectors of masked tokens are fed into a classifier over the whole text vocabulary with cross-entropy loss.

2.3. Image-Text Matching

  • Image-text matching aims to predict whether the image and text is matched.
  • The final hidden vector of the [T_CLS] token is used to represent the image-text pair, vector is fed into a classifier with cross-entropy loss for binary classification. Inspired by ALBEF, hard negative image-text pairs are sampled based on the contrastive image-to-text and text-to-image similarities.

2.4. Stagewise Pre-Training

A stagewise pre-training strategy is proposed as shown above, which leverages large-scale image-only and text-only corpus to improve the vision-language model.

The model is first performed vision pretraining on image-only data, and then perform language pre-training on text-only data to learn general image and text representations.

The model is used to initialize the vision-language pre-training to learn the alignment of visual and linguistic information.

  • For vision pre-training, the attention module and vision expert of Multiway Transformer are trained as in BEiT.

2.5. Pretraining Data

  • The pre-training data consists of four image captioning datasets: Conceptual Captions (CC) [40], SBU Captions [33], COCO [28] and Visual Genome (VG) [22]. There are about 4M images and 10M image-text pairs in the pre-training data.

3. VLMo Fine-Tuning

Fine-tuning VLMo on vision-language retrieval and classification tasks.

3.1. Vision-Language Classification

For classification tasks such as visual question answering and visual reasoning, VLMo is used as a fusion encoder to model modality interaction of images and text.

  • The final encoding vector of the token [T_CLS] is used as the representation of the image-text pair, and is fed to a task-specific classifier layer to predict the label.

3.2. Vision-Language Retrieval

For retrieval tasks, VLMo can be used as a dual encoder to encode images and text separately.

  • During fine-tuning, the model is optimized for the image-text contrastive loss.
  • During inference, the representations of all images and text are computed, and then dot product is used to obtain image-to-text and text-to-image similarity scores of all possible image-text pairs. Separate encoding enables a much faster inference speed than fusion-encoder-based models.

4. Results

4.1. Vision-Language Classification

Fine-tuning results of base-size and large-size VLMO on vision-language classification datasets.
  • VLMo-Large++: Further training VLMO-Large on one billion noisy web image-text pairs with a larger batch size.

VLMo achieves state-of-the-art performance and substantially outperforms previous methods. The proposed large-size model even outperforms SimVLM-Huge [48] and Florence-Huge [50] by a large margin, which consists of more parameters and are also trained on larger-scale image-text pairs.

  • The model uses a simple linear projection to embed images as in ViLT which leads to a significant speedup compared with previous models using image region features.

4.2. Retrieval

Fine-tuning results of text-retrieval (TR) and image-retrieval (IR) on COCO and Flickr30K.

VLMO achieves competitive performance with previous fusion-encoder-based models while having a much faster speed. The proposed large-size model even outperforms the huge-size model of Florence [50], which also trained on massive image-text pairs using a larger batch size.

4.3. Vision Tasks

Results on image classification and semantic segmentation.
  • VLMo is used as an image-only encoder.

The model also achieves competitive performance, even slightly better than the BEiT model used for the initialization of VLMo.

4.4. Ablation Studies

Ablation studies of stagewise pre-training.

Image-only pre-training plus text-only pre-training improves the vision-language model.

Ablation studies of Multiway Transformer and vision-language pre-training tasks.

Using Multiway Transformer achieves better performance than standard Transformer for both retrieval and classification tasks.

All pretraining tasks are important.

Global hard negative mining improves the model.
  • Hard negative mining from more candidates is performed by gathering training examples of all GPUs (named as global hard negative mining).

The global hard negative mining brings significant improvements.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.