Review — SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

SimVLM, Simple VLM Objective and Model Architecture

Sik-Ho Tsang
6 min readMay 1


SimVLM: Simple Visual Language Model Pretraining with Weak Supervision,
SimVLM, by Carnegie Mellon University, Google Research, Brain Team, and University of Washington,
2022 ICLR, Over 280 Citations (Sik-Ho Tsang @ Medium)

Vision Language Model (VLM)
2017 … 2021
[CLIP] [VinVL] [ALIGN] [VirTex] [ALBEF] [Conceptual 12M (CC12M)] 2022 [FILIP] [Wukong] [LiT] [Flamingo] [FLAVA] 2023 [GPT-4]
==== My Other Paper Readings Are Also Over Here ====

  • A minimalist pretraining framework, named Simple Visual Language Model (SimVLM) is proposed.
  • Unlike prior work, SimVLM reduces the training complexity by exploiting large-scale weak supervision, and is trained end-to-end with a single prefix language modeling objective.
  • Without utilizing extra data or task-specific customization, it outperforms many SOTA approaches.


  1. SimVLM
  2. Results

1. SimVLM

1.1. Preliminaries in BERT and GPT

  • As in BERT, the bidirectional Masked Language Modeling (MLM) is used to pretrain the model.
  • Given a text sequence x, a subset of tokens xm are randomly sampled and a corrupted sequence xnm is constructed by replacing tokens in xm with a special [MASK] token.
  • The training objective is to reconstruct xm from the context xnm by minimizing the negative log-likelihood:
  • Another one, as in GPT, is the LM pretraining has also been shown to be highly effective for multiple NLP tasks:

While MLM has become the de facto approach in VLP models reviewed above, the generative LM has been understudied.

1.2. SimVLM PrefixLM

SimVLM Overall Architecture. This shows an example of training with PrefixLM of an image-text pair. For text-only corpora, it is straightforward to remove the image patches and utilize textual tokens only.
  • The vision-language representation is pretrained using the Prefix Language Modeling (PrefixLM).

PrefixLM differs from the standard LM such that it enables bi-directional attention on the prefix sequence (e.g. x<Tp) and only conducts autoregressive factorization on the remaining tokens (e.g. xTp).

Intuitively, images can be considered as prefix for their textual descriptions as they often appear before text in a web document.

PrefixLM model under the sequence-to-sequence framework not only enjoys the bidirectional contextualized representation as in MLM (BERT), but also can perform text generation similar to LM (GPT).

1.3. SimVLM Model Architecture

For the visual modality, ViT-style architecture is used where the model receives the raw image x and maps it into flattened 1D sequence of patches xp, as input of Transformer.

  • First 3 blocks of ResNet are used to extract contextualized patches, which is better than naive 1×1 conv used in ViT.

For the textual modality, standard practice is used to tokenize the input sentence into sub-word tokens, and the embeddings are learned for a fixed vocabulary.

  • To retain positional information, two trainable 1D positional embeddings are added for image and text inputs separately, and 2D relative attention are additionally added for the image patches within Transformer layers, as used in CoAtNet.

3 variants of SimVLM, namely “Base”, “Large”, and “Huge” are created.

1.4. Datasets

  • SimVLM does not rely on an object detection module and only operates with raw image patch inputs. SimVLM pretrains all model parameters from scratch using large-scale noisy image-text data.

Specifically, the image and alt-text pairs introduced in ALIGN is used, which are crawled from the web with minimal post-processing.

  • On the other hand, the formulation of PrefixLM is modality-agnostic and thus SimVLM can additionally include text-only corpora to compensate for noisy text supervision in the alt-text data.

As shown later in our experiments, this unified PrefixLM formulation reduces the modality discrepancy and improves the model quality. As it is an one-pass end-to-end pretraining using single loss as above, it is called Simple Visual Language Model (SimVLM).

1.5. Some Details

  • SimVLMs are pretrained from scratch for about 1M steps on the training set of ALIGN and Colossal Clean Crawled Corpus (C4) dataset.
  • Each batch contains 4,096 image-text pairs (ALIGN) and 512 text-only documents (C4), sharded across 512 TPU v3 chips.

After pretrained, SimVLM is finetuned and evaluated on six vision-language benchmarks, including three discriminative tasks, and three generative tasks.

2. Results

2.1. SOTA Comparisons

Single model results for vision-language pretraining methods on popular VL benchmarks.

SimVLM outperforms all existing models and achieves new SOTA results on all tasks considered, often by a significant margin. This simple framework with weak supervision is sufficient to learn high-quality multi-modal representations.

  • e.g.: For the discriminative tasks, the SimVLMbase already outperforms all prior methods while using less capacity.
  • e.g.: SimVLMhuge obtains almost 4 points absolute score improvement compared to the previous SOTA (VinVL), pushing the single model performance above 80% on VQA for the first time.

2.1. Zero-Shot/Few-Shot Image Captioning

Image captioning results on CoCo Karpathy-test split and NoCaps validation split.
  • The pretrained SimVLM model is used to directly decode on image captioning benchmarks for the zero-shot setting while finetuning on 1% training data for 5 epochs is used for the few-shot setting.
  • Using a prefix prompt “A picture of” improves the quality of decoded captions.

Zero-shot/few-shot performance of SimVLM is competitive with fully supervised baselines on CoCo, which demonstrates the strong generalization.

Generated examples of SimVLM: zero-shot image captioning.

SimVLM is able to not only capture real-world concepts but also provide a detailed description of the visual input.

  • For example, the decoded samples are able to explain complex scenes with multiple objects. Besides, the model also shows understanding of fine-grained abstraction such as specific car brand and model (e.g. “Aston Martin”, “Vantage”).

2.2. Zero-Shot Cross-Modality Transfer

Generated examples of SimVLM: Zero-Shot Cross-Modality Transfer
Zero-shot cross-modality transfer results on SNLI-VE and Multi30k.
  • The model is only finetuned using training data from a source language (typically English) and evaluated on the target language without further training.

SimVLM performs competitively with fully supervised baselines including UNITER under the zero-shot setting.

2.3. Open-Ended VQA

Generated examples of SimVLM: Open-Ended VQA
Comparison of discriminative and generative VQA methods.

SimVLM outperforms both discriminative and generative baselines on all splits.

Generated examples of SimVLM: Zero-Shot Open-Ended VQA
  • SimVLM is able to generate answers not included in the 3,129 candidate set, compared to discriminative approaches. To test this, pretraining process is performed on the cleaner WIT dataset.

The above examples show that open-ended VQA ability emerges in SimVLM such that it can generate related responses after finetuning on the knowledge-rich wikipedia dataset.

2.4. Single-Modality Task (Text): GLUE

Text-Only GLUE Benchmark

SimVLM performs better than existing VLP methods and competitively with BERT, indicating that it has good language understanding ability.

2.5. Single-Modality Task (Vision): ImageNet

Linear evaluation on ImageNet classification, compared to state-of-the-art representation learning methods.

The results show that SimVLM has also learned high-quality image representation.

2.6. Ablation Study on VQA

Ablation study on VQA.
  • There are numerous of studies here.

PrefixLM objective outperforms both span corruption (T5) and naive LM, illustrating the importance of PrefixLM.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.