Review — VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
VLMo, VLM or Foundation Model, Using Multiway Transformer, Inspired by MoE and Switch Transformer
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts,
VLMo, by Harbin Institute of Technology, and Microsoft Corporation,
2022 NeurIPS, Over 90 Citations (Sik-Ho Tsang @ Medium)Vision Language Model (VLM) / Foundation Model
2017 … 2021 [CLIP] [VinVL] [ALIGN] [VirTex] [ALBEF] [Conceptual 12M (CC12M)] 2022 [FILIP] [Wukong] [LiT] [Flamingo] [FLAVA] [SimVLM] 2023 [GPT-4]
==== My Other Paper Readings Are Also Over Here ====
- A unified Vision-Language pretrained Model (VLMo) is proposed, which jointly learns a dual encoder and a fusion encoder with a modular Transformer network.
- Within VLMo, Multiway Transformer is proposed, where each block contains a pool of modality-specific experts and a shared self-attention layer.
- The pretrained VLMO can be fine-tuned as a fusion encoder for vision-language classification tasks, or used as a dual encoder for efficient image-text retrieval.
Outline
- VLMo Model Architecture
- VLMo Pretraining
- VLMo Fine-Tuning
- Results
1. VLMo Model Architecture
- Given an image-text pair, the pair is encoded into image, text and image-text vector representations. These representations are then fed into the Multiway Transformer to learn contextualized representations and align image and text feature vectors.
1.1. Image Representations
- Following ViT, The 2D image v is split and reshaped into N patches vp.
- The image patches are then flattened into vectors and are linearly projected to obtain patch embeddings.
- A learnable special token [I_CLS] is prepended to the sequence.
Finally, image input representations are obtained via summing patch embeddings, learnable 1D position embeddings Vpos and image type embedding Vtype:
1.2. Text Representations
- Following BERT, text is tokenized to subword units by WordPiece [49].
- A start-of-sequence token ([T_CLS]) and a special boundary token ([T_SEP]) are added to the text sequence.
Text input representations Hw0 are computed via summing the corresponding word embedding, text position embedding and text type embedding:
1.3. Image-Text Representations
Image and text input vectors are concatenated to form the image-text input representations:
1.4. Multiway Transformer
- Inspired by mixture-of-experts networks (MoE and Switch Transformer), a general-purpose multimodal Transformer is proposed for vision-language tasks, namely Multiway Transformer, to encode different modalities.
- Each modality expert is also the feed forward network (FFN) which consists of two linear transformations and an activation.
- Given previous layer’s output vectors Hl−1, l ∈ [1, L], each Multiway Transformer block captures modality-specific information by switching to different modality expert, and employs multi-head self-attention (MSA) shared across modalities to align visual and linguistic contents. LN is short for Layer Normalization:
- where Multiway-FFN selects an expert among multiple modality experts to process the input according to the modality of the input vectors H′l and the index of the Transformer layer.
- There are are three modality experts: vision expert (V-FFN), language expert (L-FFN) and vision-language expert (VL-FFN), as shown in the figure above.
If the input is image-only or text-only vectors, vision expert is used for encoding images and language expert for encoding text.
If the input consists of vectors of multiple modalities, such as the vectors of image-text pair, vision expert and language expert are employed to encode the respective modality vectors at the bottom Transformer layers. Vision-language expert is then used at the top layers to capture more modality interaction.
1.5. Model Variants
VLMo-Base consists of 12-layer Transformer blocks with 768 hidden size and 12 attention heads.
VLMo-Large is a 24-layer Transformer network with 1024 hidden size and 16 attention heads.
- VLMo-Base uses vision-language expert on the top two Transformer layers, and VLMo-Large introduces vision-language expert on the top three Transformer layers.
- VLMo-Base consists of 175M parameters and VLMo-Large contains 562M parameters.
2. VLMo Pretraining
- VLMO is jointly pretrained by image-text contrastive learning on the image and text representations, masked language modeling and image-text matching on the image-text pair representations with shared parameters.
2.1. Image-Text Contrast
- Given a batch of N image-text pairs, image-text contrastive learning aims to predict the matched pairs from N×N possible image-text pairs. There are N²−N negative image-text pairs within a training batch.
- The final output vectors of [I_CLS] token and [T_CLS] token are used as the aggregated representation of the image and text, respectively.
- Followed by a linear projection and normalization, image vectors {ˆhvi} and text vectors {ˆhwi} are obtained in a training batch to compute image-to-text and text-to-image similarities:
- pi2ti and pt2ii are the softmax-normalized similarities.
2.2. Masked Language Modeling
- Following BERT, tokens are randomly chosen in the text sequence, and replaced with the [MASK] token. 15% masking probability is used.
- The final output vectors of masked tokens are fed into a classifier over the whole text vocabulary with cross-entropy loss.
2.3. Image-Text Matching
- Image-text matching aims to predict whether the image and text is matched.
- The final hidden vector of the [T_CLS] token is used to represent the image-text pair, vector is fed into a classifier with cross-entropy loss for binary classification. Inspired by ALBEF, hard negative image-text pairs are sampled based on the contrastive image-to-text and text-to-image similarities.
2.4. Stagewise Pre-Training
A stagewise pre-training strategy is proposed as shown above, which leverages large-scale image-only and text-only corpus to improve the vision-language model.
The model is first performed vision pretraining on image-only data, and then perform language pre-training on text-only data to learn general image and text representations.
The model is used to initialize the vision-language pre-training to learn the alignment of visual and linguistic information.
- For vision pre-training, the attention module and vision expert of Multiway Transformer are trained as in BEiT.
2.5. Pretraining Data
- The pre-training data consists of four image captioning datasets: Conceptual Captions (CC) [40], SBU Captions [33], COCO [28] and Visual Genome (VG) [22]. There are about 4M images and 10M image-text pairs in the pre-training data.
3. VLMo Fine-Tuning
3.1. Vision-Language Classification
For classification tasks such as visual question answering and visual reasoning, VLMo is used as a fusion encoder to model modality interaction of images and text.
- The final encoding vector of the token [T_CLS] is used as the representation of the image-text pair, and is fed to a task-specific classifier layer to predict the label.
3.2. Vision-Language Retrieval
For retrieval tasks, VLMo can be used as a dual encoder to encode images and text separately.
- During fine-tuning, the model is optimized for the image-text contrastive loss.
- During inference, the representations of all images and text are computed, and then dot product is used to obtain image-to-text and text-to-image similarity scores of all possible image-text pairs. Separate encoding enables a much faster inference speed than fusion-encoder-based models.
4. Results
4.1. Vision-Language Classification
- VLMo-Large++: Further training VLMO-Large on one billion noisy web image-text pairs with a larger batch size.
VLMo achieves state-of-the-art performance and substantially outperforms previous methods. The proposed large-size model even outperforms SimVLM-Huge [48] and Florence-Huge [50] by a large margin, which consists of more parameters and are also trained on larger-scale image-text pairs.
- The model uses a simple linear projection to embed images as in ViLT which leads to a significant speedup compared with previous models using image region features.
4.2. Retrieval
VLMO achieves competitive performance with previous fusion-encoder-based models while having a much faster speed. The proposed large-size model even outperforms the huge-size model of Florence [50], which also trained on massive image-text pairs using a larger batch size.
4.3. Vision Tasks
- VLMo is used as an image-only encoder.
The model also achieves competitive performance, even slightly better than the BEiT model used for the initialization of VLMo.
4.4. Ablation Studies
Image-only pre-training plus text-only pre-training improves the vision-language model.
Using Multiway Transformer achieves better performance than standard Transformer for both retrieval and classification tasks.
All pretraining tasks are important.
- Hard negative mining from more candidates is performed by gathering training examples of all GPUs (named as global hard negative mining).
The global hard negative mining brings significant improvements.