Brief Review — An Empirical Study of Training End-to-End Vision-and-Language Transformers

METER Framework for VLM

Sik-Ho Tsang
4 min readDec 18, 2024
METER Framework

An Empirical Study of Training End-to-End Vision-and-Language Transformers, METER, by University of California, Microsoft Corporation
2022 CVPR, Over 380 Citations (Sik-Ho Tsang @ Medium)

Visual/Vision/Video Language Model (VLM)
2017 … 2023
[GPT-4] [GPT-4V(ision)] [MultiModal-CoT] [CoCa] [Florence-2] [PaLI] [PaLI-X] [OpenCLIP] 2024 [MiniGPT-4]
==== My Other Paper Readings Are Also Over Here ====

  • METER, a Multimodal End-to-end TransformER framework, is proposed, in which authors investigate how to design and pre-train a fully transformer-based VL model in an end-to-end manner.
  • Particularly, vision encoders, text encoders, multimodal fusion module, architectural design, and pretrainig objectives are investigated.

Outline

  1. METER
  2. Results

1. METER

Glossary of representative VLP models.
  • There are various VLMs, using different vision encoders, text encoders, multimodal fusion module, architectural design, and pretrainig objectives. The goal is to study which combination is the best.

1.1. Vision Encoder & Text Encoder

  • Given a text sentence l and an image v, a VLP model first extracts both text features and visual features via a text encoder and a vision encoder.
  • The text and visual features are then fed into a multimodal fusion module to produce cross-modal representations, which are then optionally fed into a decoder before generating the final outputs.

1.1. Multimodal Fusion Module

Multimodal Fusion Module
  • For multimodal fusion module, there are 2 types.
  • In the merged attention module, the text and visual features are simply concatenated together, then fed into a single transformer block. In the co-attention module.
  • On the other hand, in co-attention, the text and visual features are fed into different transformer blocks independently, and techniques such as cross-attention are used to enable cross-modal interaction.

1.3. Encoder-Only & Encoder-Decoder

Encoder-Only & Encoder-Decoder
  • There is an encoder-only architecture, where the cross-modal representations are directly fed into an output layer to generate the final outputs.
  • Encoder-decoder architecture, is to have the cross-modal representations feeding into a decoder and then to an output layer.

1.4. Pretraining Objectives

  • Masked Language Modeling (MLM): randomly mask some of the input tokens, and the model is trained to reconstruct the original tokens
  • Image-Text Matching (ITM): the model needs to identify which images and captions correspond to each other.
  • Masked Image Modeling (MIM): Similar to the MLM objective, researchers have tried masked image modeling (MIM) on the vision side.
In-batch Negatives & Discrete Code
  • In-batch Negatives: With masking, the model is trained to maximize its probability similar to noise contrastive estimation.
  • Discrete Code: the model is trained to predict the discrete tokens instead of the masked patches.

2. Results

2.1. Ablation Studies

CLIP and RoBERTa are used as default.

Co-attention without the decoder is the best.

MLM+ITM is the best.

2.2. SOTA Comparisons

Compared with models pre-trained with fewer than 10M images, the proposed CLIP-based model (METER-CLIP-ViTBASE) can achieve either the best or the second best scores on all the downstream tasks.

  • In addition, while ALBEF has specially-designed objectives for retrieval, the proposed model can still outperform ALBEF on text and image retrieval tasks.

2.3. Model Scaling

  • CoSwin-Huge [55] is used as the vision backbone and RoBERTa-base is used as the text backbone.

The proposed model can achieve state-of-the-art performance on VQAv2, surpassing previous models SimVLM trained with 1.8B images. The results indicate that the proposed METER framework is scalable.

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

No responses yet