Brief Review — An Empirical Study of Training End-to-End Vision-and-Language Transformers
METER Framework for VLM
An Empirical Study of Training End-to-End Vision-and-Language Transformers, METER, by University of California, Microsoft Corporation
2022 CVPR, Over 380 Citations (Sik-Ho Tsang @ Medium)Visual/Vision/Video Language Model (VLM)
2017 … 2023 [GPT-4] [GPT-4V(ision)] [MultiModal-CoT] [CoCa] [Florence-2] [PaLI] [PaLI-X] [OpenCLIP] 2024 [MiniGPT-4]
==== My Other Paper Readings Are Also Over Here ====
- METER, a Multimodal End-to-end TransformER framework, is proposed, in which authors investigate how to design and pre-train a fully transformer-based VL model in an end-to-end manner.
- Particularly, vision encoders, text encoders, multimodal fusion module, architectural design, and pretrainig objectives are investigated.
Outline
- METER
- Results
1. METER
- There are various VLMs, using different vision encoders, text encoders, multimodal fusion module, architectural design, and pretrainig objectives. The goal is to study which combination is the best.
1.1. Vision Encoder & Text Encoder
- Given a text sentence l and an image v, a VLP model first extracts both text features and visual features via a text encoder and a vision encoder.
- The text and visual features are then fed into a multimodal fusion module to produce cross-modal representations, which are then optionally fed into a decoder before generating the final outputs.
1.1. Multimodal Fusion Module
- For multimodal fusion module, there are 2 types.
- In the merged attention module, the text and visual features are simply concatenated together, then fed into a single transformer block. In the co-attention module.
- On the other hand, in co-attention, the text and visual features are fed into different transformer blocks independently, and techniques such as cross-attention are used to enable cross-modal interaction.
1.3. Encoder-Only & Encoder-Decoder
- There is an encoder-only architecture, where the cross-modal representations are directly fed into an output layer to generate the final outputs.
- Encoder-decoder architecture, is to have the cross-modal representations feeding into a decoder and then to an output layer.
1.4. Pretraining Objectives
- Masked Language Modeling (MLM): randomly mask some of the input tokens, and the model is trained to reconstruct the original tokens
- Image-Text Matching (ITM): the model needs to identify which images and captions correspond to each other.
- Masked Image Modeling (MIM): Similar to the MLM objective, researchers have tried masked image modeling (MIM) on the vision side.
- In-batch Negatives: With masking, the model is trained to maximize its probability similar to noise contrastive estimation.
- Discrete Code: the model is trained to predict the discrete tokens instead of the masked patches.
2. Results
2.1. Ablation Studies
Co-attention without the decoder is the best.
MLM+ITM is the best.
2.2. SOTA Comparisons
Compared with models pre-trained with fewer than 10M images, the proposed CLIP-based model (METER-CLIP-ViTBASE) can achieve either the best or the second best scores on all the downstream tasks.
- In addition, while ALBEF has specially-designed objectives for retrieval, the proposed model can still outperform ALBEF on text and image retrieval tasks.
2.3. Model Scaling
- CoSwin-Huge [55] is used as the vision backbone and RoBERTa-base is used as the text backbone.
The proposed model can achieve state-of-the-art performance on VQAv2, surpassing previous models SimVLM trained with 1.8B images. The results indicate that the proposed METER framework is scalable.