Brief Review — PaLI: A Jointly-Scaled Multilingual Language-Image Model

PaLI, Joint Scaling for Both mT5 and ViT Is Important

4 min readMar 21, 2024

PaLI: A Jointly-Scaled Multilingual Language-Image Model
PaLI, by Google Research
2023 ICLR, Over 340 Citations (Sik-Ho Tsang @ Medium)
Visual/Vision/Video Language Model (VLM)
2017 … 2022 [FILIP] [Wukong] [LiT] [Flamingo] [FLAVA] [SimVLM] [VLMo] [BEiT-3] [GLIP] 2023 [GPT-4] [GPT-4V(ision)] [MultiModal-CoT] [CoCa]
==== My Other Paper Readings Are Also Over Here ====

PaLI (Pathways Language and Image model) is proposed, which generates text based on visual and textual inputs.
Joint scaling is also considered with a large, 4-billion parameter ViT (ViT-e) trained.
A large multilingual mix of pre-training tasks is created, based on a new image-text training set containing 10B images and texts in over 100 languages.

Outline

PaLI
Results

1. PaLI

PaLI accepts as input an image and text string, and generates text as output.

1.1. Visual Component

The largest vanilla ViT architecture to date (at that moment) is introduced and trained, named ViT-e. ViT-e has the same architecture and uses the same training recipe as the 1.8B parameter ViT, ViT-G, while scaling to 4B parameters.

1.2. Language Component

The pre-trained mT5-Large (1B parameters) and mT5-XXL (13B parameters) are used.
A mix of many tasks is used for training.

1.3. Overall Model

3 model sizes are considered:

PaLI-3B, where the language component is initialized from mT5-Large (1B parameters), and the vision component is ViT-G (1.8B parameters).
PaLI-15B, where the language component is initialized from mT5-XXL (13B parameters), and the vision component is ViT-G (1.8B parameters).
PaLI-17B, where the language model is initialized from mT5-XXL, and the vision component is the newly-trained ViT-e model (4B parameters).

1.4. Data

WebLI, a multilingual image-language dataset built from images and texts available on the public web, is introduced.
It covers 10 billion images and 12 billion alt-texts.
Publicly available automatic service is used to extract OCR annotations on all images, resulting in 29 billion image-OCR pairs.
The dataset is filtered to the highest quality subset retaining only the top 10% scoring of the original WebLI image-text pairs (about 1B examples).

1.5. Training Mixture

PaLI is trained using a mixture of 8 pre-training tasks:

Span corruption on text-only data, Split-captioning on WebLI alt-text data, Captioning on CC3M-35L, OCR on WebLI OCR-text data, English and Cross-Lingual VQA, English and Cross-Lingual visual question generation (VQG), English-only Object-Aware (OA) VQA, Object detection.
Each task is specified using a training data source and a template-based prompt, and the model is trained using a language-model–style teacher forcing with a standard softmax cross-entropy loss.

The whole mixture is slightly smaller and designed to be cleaner than the datasets used in SimVLM (1.8B), CoCa (1.8B), and Flamingo (2.3B).

1.6. Model Training

All PaLI variants are trained for one epoch over the entire pre-training dataset (1.6B) with 224×224 image resolution.
For the largest model, PaLI-17B, an additional high-res (588×588) phase is performed. This phase is only for 10k steps, covering 10M examples in total.
PaLI-3B and PaLI-15B are fine-tuned and checkpoints at 490×490 resolutions are evaluated.