Review — VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

VLMo, VLM or Foundation Model, Using Multiway Transformer, Inspired by MoE and Switch Transformer

7 min readMay 26, 2023

VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts,
VLMo, by Harbin Institute of Technology, and Microsoft Corporation,
2022 NeurIPS, Over 90 Citations (Sik-Ho Tsang @ Medium)
Vision Language Model (VLM) / Foundation Model
2017 … 2021 [CLIP] [VinVL] [ALIGN] [VirTex] [ALBEF] [Conceptual 12M (CC12M)] 2022 [FILIP] [Wukong] [LiT] [Flamingo] [FLAVA] [SimVLM] 2023 [GPT-4]
==== My Other Paper Readings Are Also Over Here ====

A unified Vision-Language pretrained Model (VLMo) is proposed, which jointly learns a dual encoder and a fusion encoder with a modular Transformer network.
Within VLMo, Multiway Transformer is proposed, where each block contains a pool of modality-specific experts and a shared self-attention layer.
The pretrained VLMO can be fine-tuned as a fusion encoder for vision-language classification tasks, or used as a dual encoder for efficient image-text retrieval.

Outline

VLMo Model Architecture
VLMo Pretraining
VLMo Fine-Tuning
Results

1. VLMo Model Architecture

Inputs

Given an image-text pair, the pair is encoded into image, text and image-text vector representations. These representations are then fed into the Multiway Transformer to learn contextualized representations and align image and text feature vectors.

1.1. Image Representations

Following ViT, The 2D image v is split and reshaped into N patches vp.
The image patches are then flattened into vectors and are linearly projected to obtain patch embeddings.
A learnable special token [I_CLS] is prepended to the sequence.

Finally, image input representations are obtained via summing patch embeddings, learnable 1D position embeddings Vpos and image type embedding Vtype:

1.2. Text Representations

Following BERT, text is tokenized to subword units by WordPiece [49].
A start-of-sequence token ([T_CLS]) and a special boundary token ([T_SEP]) are added to the text sequence.

Text input representations Hw0 are computed via summing the corresponding word embedding, text position embedding and text type embedding:

1.3. Image-Text Representations

Image and text input vectors are concatenated to form the image-text input representations:

1.4. Multiway Transformer

Inspired by mixture-of-experts networks (MoE and Switch Transformer), a general-purpose multimodal Transformer is proposed for vision-language tasks, namely Multiway Transformer, to encode different modalities.
Each modality expert is also the feed forward network (FFN) which consists of two linear transformations and an activation.
Given previous layer’s output vectors Hl−1, l ∈ [1, L], each Multiway Transformer block captures modality-specific information by switching to different modality expert, and employs multi-head self-attention (MSA) shared across modalities to align visual and linguistic contents. LN is short for Layer Normalization:

where Multiway-FFN selects an expert among multiple modality experts to process the input according to the modality of the input vectors H′l and the index of the Transformer layer.
There are are three modality experts: vision expert (V-FFN), language expert (L-FFN) and vision-language expert (VL-FFN), as shown in the figure above.

If the input is image-only or text-only vectors, vision expert is used for encoding images and language expert for encoding text.
If the input consists of vectors of multiple modalities, such as the vectors of image-text pair, vision expert and language expert are employed to encode the respective modality vectors at the bottom Transformer layers. Vision-language expert is then used at the top layers to capture more modality interaction.

1.5. Model Variants

VLMo-Base consists of 12-layer Transformer blocks with 768 hidden size and 12 attention heads.
VLMo-Large is a 24-layer Transformer network with 1024 hidden size and 16 attention heads.

VLMo-Base uses vision-language expert on the top two Transformer layers, and VLMo-Large introduces vision-language expert on the top three Transformer layers.
VLMo-Base consists of 175M parameters and VLMo-Large contains 562M parameters.

2. VLMo Pretraining

**Stagewise pre-training using image-only and text-only corpora.**

VLMO is jointly pretrained by image-text contrastive learning on the image and text representations, masked language modeling and image-text matching on the image-text pair representations with shared parameters.

2.1. Image-Text Contrast

Given a batch of N image-text pairs, image-text contrastive learning aims to predict the matched pairs from N×N possible image-text pairs. There are N²−N negative image-text pairs within a training batch.
The final output vectors of [I_CLS] token and [T_CLS] token are used as the aggregated representation of the image and text, respectively.
Followed by a linear projection and normalization, image vectors {ˆhvi} and text vectors {ˆhwi} are obtained in a training batch to compute image-to-text and text-to-image similarities:

pi2ti and pt2ii are the softmax-normalized similarities.

2.2. Masked Language Modeling

Following BERT, tokens are randomly chosen in the text sequence, and replaced with the [MASK] token. 15% masking probability is used.
The final output vectors of masked tokens are fed into a classifier over the whole text vocabulary with cross-entropy loss.

2.3. Image-Text Matching

Image-text matching aims to predict whether the image and text is matched.
The final hidden vector of the [T_CLS] token is used to represent the image-text pair, vector is fed into a classifier with cross-entropy loss for binary classification. Inspired by ALBEF, hard negative image-text pairs are sampled based on the contrastive image-to-text and text-to-image similarities.

2.4. Stagewise Pre-Training

A stagewise pre-training strategy is proposed as shown above, which leverages large-scale image-only and text-only corpus to improve the vision-language model.
The model is first performed vision pretraining on image-only data, and then perform language pre-training on text-only data to learn general image and text representations.
The model is used to initialize the vision-language pre-training to learn the alignment of visual and linguistic information.

For vision pre-training, the attention module and vision expert of Multiway Transformer are trained as in BEiT.

2.5. Pretraining Data

The pre-training data consists of four image captioning datasets: Conceptual Captions (CC) [40], SBU Captions [33], COCO [28] and Visual Genome (VG) [22]. There are about 4M images and 10M image-text pairs in the pre-training data.

3. VLMo Fine-Tuning

**Fine-tuning VLMo on vision-language retrieval and classification tasks.**

3.1. Vision-Language Classification

For classification tasks such as visual question answering and visual reasoning, VLMo is used as a fusion encoder to model modality interaction of images and text.

The final encoding vector of the token [T_CLS] is used as the representation of the image-text pair, and is fed to a task-specific classifier layer to predict the label.

3.2. Vision-Language Retrieval

For retrieval tasks, VLMo can be used as a dual encoder to encode images and text separately.

During fine-tuning, the model is optimized for the image-text contrastive loss.
During inference, the representations of all images and text are computed, and then dot product is used to obtain image-to-text and text-to-image similarity scores of all possible image-text pairs. Separate encoding enables a much faster inference speed than fusion-encoder-based models.

4. Results

4.1. Vision-Language Classification

**Fine-tuning results of base-size and large-size VLMO on vision-language classification datasets.**

VLMo-Large++: Further training VLMO-Large on one billion noisy web image-text pairs with a larger batch size.

VLMo achieves state-of-the-art performance and substantially outperforms previous methods. The proposed large-size model even outperforms SimVLM-Huge [48] and Florence-Huge [50] by a large margin, which consists of more parameters and are also trained on larger-scale image-text pairs.

The model uses a simple linear projection to embed images as in ViLT which leads to a significant speedup compared with previous models using image region features.

4.2. Retrieval

**Fine-tuning results of text-retrieval (TR) and image-retrieval (IR) on COCO and Flickr30K.**

VLMO achieves competitive performance with previous fusion-encoder-based models while having a much faster speed. The proposed large-size model even outperforms the huge-size model of Florence [50], which also trained on massive image-text pairs using a larger batch size.

4.3. Vision Tasks

**Results on image classification and semantic segmentation.**

VLMo is used as an image-only encoder.

The model also achieves competitive performance, even slightly better than the BEiT model used for the initialization of VLMo.

4.4. Ablation Studies

**Ablation studies of stagewise pre-training.**

Image-only pre-training plus text-only pre-training improves the vision-language model.

**Ablation studies of Multiway Transformer and vision-language pre-training tasks.**

Using Multiway Transformer achieves better performance than standard Transformer for both retrieval and classification tasks.
All pretraining tasks are important.

**Global hard negative mining improves the model.**

Hard negative mining from more candidates is performed by gathering training examples of all GPUs (named as global hard negative mining).

The global hard negative mining brings significant improvements.

Review — VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

VLMo, VLM or Foundation Model, Using Multiway Transformer, Inspired by MoE and Switch Transformer

Outline

1. VLMo Model Architecture

1.1. Image Representations

1.2. Text Representations

1.3. Image-Text Representations

1.4. Multiway Transformer

1.5. Model Variants

2. VLMo Pretraining

2.1. Image-Text Contrast

2.2. Masked Language Modeling

2.3. Image-Text Matching

2.4. Stagewise Pre-Training

2.5. Pretraining Data

3. VLMo Fine-Tuning

3.1. Vision-Language Classification

3.2. Vision-Language Retrieval

4. Results

4.1. Vision-Language Classification

4.2. Retrieval

4.3. Vision Tasks

4.4. Ablation Studies

Written by Sik-Ho Tsang

No responses yet