Brief Review — PaLI-X: On Scaling up a Multilingual Vision and Language Model

PaLI-X, By Scaling Up PaLI

3 min readJul 4, 2024

PaLI-X: On Scaling up a Multilingual Vision and Language Model
PaLI-X, by Google Research
2023 arXiv v1, Over 80 Citations (Sik-Ho Tsang @ Medium)
Visual/Vision/Video Language Model (VLM)
2017 … 2023 [GPT-4] [GPT-4V(ision)] [MultiModal-CoT] [CoCa] [Florence-2] [PaLI]
==== My Other Paper Readings Are Also Over Here ====

Similar to other scaling up papers, e.g.: ViT-G, PaLI is scaled up as PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture.

Outline

PaLI-X
Results

1. PaLI-X

The PaLI-X model architecture follows the encoder-decoder architecture: image(s) are processed by a ViT encoder, with the resulting visual embeddings fed to an encoder-decoder backbone, along with embeddings from additional text input (e.g., question / prefix / prompt).

1.1. Visual Component

The visual backbone is scaled to 22B parameters.
An OCR-based pretraining is used as follows: images from the WebLI dataset [5] are annotated with OCR-text detected by GCP Vision API.
The encoder is then further pre-trained with a mixture of the original JFT-based classification task and a new OCR-based classification task.
PaLI-X is designed to take n >= 1 images as inputs (for few-shot and video understanding).

For an n-frame input with k-patches per frame, the resulting visual input has n ∗ k tokens.

1.2. Overall Model

The encoder-decoder backbone is initialized from a variant of the UL2 encoder-decoder model that uses 32B parameters.
The architecture of this variant has 50 layers in both encoder and decoder (up from 32 layers in UL2), and is pretrained on a mixture of text data similar to UL2.
The visual embeddings, after going through a projection layer, are concatenated with the token embeddings of the text input, and fed to the encoder-decoder backbone.

1.3. Pretraining Data and Mixture

The main pretraining data for the model is based on WebLI [5], consisting of roughly one billion images with alt-texts from the web and OCR annotations (using the GCP Vision API), covering over 100 languages.
In addition to WebLI ⟨image, text⟩ pairs, Episodic WebLI data is introduced here, where each episode corresponds to a set of such pairs to encourage attention among examples in an “episode”. This new dataset (with 75M episodes and around 400M images in total) is important for developing the few-shot capabilities of the model.
The pretraining mixture consists of diverse data and objectives, e.g. object detection, captioning, and video question anwsering.

2. Results

**By Scaling Up, PaLI-X Outperforms** **PaLI**

Figure 1 Left: It is observed that scaling leads to large improvements over the results of the PaLI model, and also over specialized large-scale models that are trained specifically to solve certain tasks, often with the help of (often much larger) text-only LLMs.
Figure 1 Right: PaLI-X improves both state-of-the-art results and the Pareto frontier for fine-tuning and few-shot configurations.