BEiT-3: Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

Vision-Language Foundation Model, With Unified Architecture Using Multiway Transformer for Multiple Modalities

Sik-Ho Tsang
4 min readJun 4


BEiT-3 Overview

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks,
BEiT-3, by Microsoft Corporation
2022 CVPR, Over 140 Citations (Sik-Ho Tsang)

Vision Language Model (VLM) / Foundation Model
2017 … 2021
[CLIP] [VinVL] [ALIGN] [VirTex] [ALBEF] [Conceptual 12M (CC12M)] 2022 [FILIP] [Wukong] [LiT] [Flamingo] [FLAVA] [SimVLM] [VLMo] 2023 [GPT-4]
==== My Other Paper Readings Are Also Over Here ====

  • A general-purpose multimodal foundation model, BEiT-3, is proposed. Specifically, three aspects are advanced: backbone architecture, pretraining task, and model scaling up.
  • BEiT-3 uses Multiway Transformers (from VLMo) to enable both deep fusion and modality-specific encoding.
  • BEiT-3 applies Masked “language” modeling on images (Imglish), texts (English), and image-text pairs (“parallel sentences”) in a unified manner.


  1. BEiT-3
  2. Results

1. BEiT-3

BEiT-3 Architecture

1.1. Overall Architecture

  • Multiway Transformers are used as the backbone model to encode different modalities.
  • Each Multiway Transformer block consists of a shared self-attention module, and a pool of feed-forward networks (i.e., modality experts) used for different modalities.

The shared self-attention module learns the alignment between different modalities and enables deep fusion for multimodal (such as vision-language) tasks.

  • Each input token is routed to the experts depending on its modality.
  • Each layer contains a vision expert and a language expert.
  • Moreover, the top three layers have vision-language experts designed for fusion encoders.
BEiT-3 Model Hyperparameters
  • BEiT-3 consists of a 40-layer Multiway Transformer with 1408 hidden size, 6144 intermediate size, and 16 attention heads.

It consists of 1.9B parameters in total, including 692M parameters for vision experts, 692M parameters for language experts, 52M parameters for vision-language experts, and 317M parameters for the shared self-attention module.

1.2. Unfied Architecture for Downstream Tasks

  • This unified architecture enables BEiT-3 to support a wide range of downstream tasks.

For example, BEIT-3 can be used as an image backbone for various vision tasks, including image classification, object detection, instance segmentation, and semantic segmentation.

It can also be fine-tuned as a dual encoder for efficient image-text retrieval, and a fusion model for multimodal understanding and generation tasks.

1.3. Pretraining Task: Masked Data Modeling

  • The unified mask-then-predict task not only learns representations but also learns the alignment of different modalities.
  • Specifically, text data is tokenized by a SentencePiece tokenizer.
  • Image data is tokenized by the tokenizer of BEiT V2 to obtain the discrete visual tokens as the reconstructed targets.

BEiT-3 randomly masks 15% tokens of monomodal texts and 50% tokens of texts from image-text pairs. For images, BEiT-3 masks 40% of image patches using a block-wise masking strategy as in BEiT.

1.4. Scaling Up: BEiT-3 Pretraining

Pretraining Data
  • For multimodal data, there are about 15M images and 21M image-text pairs collected from five public datasets: Conceptual 12M (CC12M), Conceptual Captions (CC3M), SBU Captions (SBU), COCO and Visual Genome (VG).
  • For monomodal data, 14M images from ImageNet-21K and 160GB text corpora from English Wikipedia, BookCorpus, OpenWebText, CC-News, and Stories.
  • With unified architecture, one pretraining task is applied, which makes the scaling up easy.

A much smaller pretraining batch size can be used with the mask-then-predict task. In comparison, contrastive-based models usually need a very large batch size for pretraining, which brings more engineering challenges, such as GPU memory cost.

BEIT-3 is pretrained for 1M steps. Each batch contains 6144 samples in total, including 2048 images, 2048 texts and 2048 image-text pairs. The batch size is much smaller than contrastive models.

2. Results

Performance on Wide Range of Downstream Tasks
Performance on Wide Range of Downstream Tasks
  • BEIT-3 is evaluated on major public benchmarks for both vision-language and vision tasks.
  • The above figure and table present the overview of results.

BEIT-3 obtains state-of-the-art performance on a wide range of vision and vision-language tasks.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.