Brief Review — PaLI: A Jointly-Scaled Multilingual Language-Image Model

PaLI, Joint Scaling for Both mT5 and ViT Is Important

Sik-Ho Tsang
4 min readMar 21, 2024
PaLI Framework

PaLI: A Jointly-Scaled Multilingual Language-Image Model
, by Google Research
2023 ICLR, Over 340 Citations (Sik-Ho Tsang @ Medium)

Visual/Vision/Video Language Model (VLM)
20172022 [FILIP] [Wukong] [LiT] [Flamingo] [FLAVA] [SimVLM] [VLMo] [BEiT-3] [GLIP] 2023 [GPT-4] [GPT-4V(ision)] [MultiModal-CoT] [CoCa]
==== My Other Paper Readings Are Also Over Here ====

  • PaLI (Pathways Language and Image model) is proposed, which generates text based on visual and textual inputs.
  • Joint scaling is also considered with a large, 4-billion parameter ViT (ViT-e) trained.
  • A large multilingual mix of pre-training tasks is created, based on a new image-text training set containing 10B images and texts in over 100 languages.


  1. PaLI
  2. Results

1. PaLI

PaLI Framework

PaLI accepts as input an image and text string, and generates text as output.

1.1. Visual Component

The largest vanilla ViT architecture to date (at that moment) is introduced and trained, named ViT-e. ViT-e has the same architecture and uses the same training recipe as the 1.8B parameter ViT, ViT-G, while scaling to 4B parameters.

1.2. Language Component

  • The pre-trained mT5-Large (1B parameters) and mT5-XXL (13B parameters) are used.
  • A mix of many tasks is used for training.

1.3. Overall Model

  • 3 model sizes are considered:
  1. PaLI-3B, where the language component is initialized from mT5-Large (1B parameters), and the vision component is ViT-G (1.8B parameters).
  2. PaLI-15B, where the language component is initialized from mT5-XXL (13B parameters), and the vision component is ViT-G (1.8B parameters).
  3. PaLI-17B, where the language model is initialized from mT5-XXL, and the vision component is the newly-trained ViT-e model (4B parameters).

1.4. Data

  • WebLI, a multilingual image-language dataset built from images and texts available on the public web, is introduced.
  • It covers 10 billion images and 12 billion alt-texts.
  • Publicly available automatic service is used to extract OCR annotations on all images, resulting in 29 billion image-OCR pairs.
  • The dataset is filtered to the highest quality subset retaining only the top 10% scoring of the original WebLI image-text pairs (about 1B examples).

1.5. Training Mixture

PaLI is trained using a mixture of 8 pre-training tasks:

  • Span corruption on text-only data, Split-captioning on WebLI alt-text data, Captioning on CC3M-35L, OCR on WebLI OCR-text data, English and Cross-Lingual VQA, English and Cross-Lingual visual question generation (VQG), English-only Object-Aware (OA) VQA, Object detection.
  • Each task is specified using a training data source and a template-based prompt, and the model is trained using a language-model–style teacher forcing with a standard softmax cross-entropy loss.

The whole mixture is slightly smaller and designed to be cleaner than the datasets used in SimVLM (1.8B), CoCa (1.8B), and Flamingo (2.3B).

1.6. Model Training

  • All PaLI variants are trained for one epoch over the entire pre-training dataset (1.6B) with 224×224 image resolution.
  • For the largest model, PaLI-17B, an additional high-res (588×588) phase is performed. This phase is only for 10k steps, covering 10M examples in total.
  • PaLI-3B and PaLI-15B are fine-tuned and checkpoints at 490×490 resolutions are evaluated.

2. Results

2.1. Image Captioning

Image Captioning

PaLI-17B outperforms all prior models on recognizing and describing long-tail objects outside of COCO’s domain.

Image Captioning

PaLI outperforms previous SOTA by large margins.

2.2. VQA


On VQAv2, PaLI achieves 84.3 accuracy, outperforming previous SOTA.


The above Table 4 shows significant gains on both benchmarks across all languages.

2.3. Zero-Shot Image Classification

PaLI-17B is significantly better than smaller variants. PaLI with a zero-shot setting outperforms the 1-shot learning result from Flamingo.

2.4. Ablation Studies

Jointly scaling the capacity of both components leads to performance improvements.

2.5. Qualitative Results

  • More examples are shown above.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.