Brief Review — PaLI-X: On Scaling up a Multilingual Vision and Language Model

PaLI-X, By Scaling Up PaLI

Sik-Ho Tsang
3 min read2 days ago

PaLI-X: On Scaling up a Multilingual Vision and Language Model
PaLI-X
, by Google Research
2023 arXiv v1, Over 80 Citations (Sik-Ho Tsang @ Medium)

Visual/Vision/Video Language Model (VLM)
20172023 [GPT-4] [GPT-4V(ision)] [MultiModal-CoT] [CoCa] [Florence-2] [PaLI]
==== My Other Paper Readings Are Also Over Here ====

  • Similar to other scaling up papers, e.g.: ViT-G, PaLI is scaled up as PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture.

Outline

  1. PaLI-X
  2. Results

1. PaLI-X

  • The PaLI-X model architecture follows the encoder-decoder architecture: image(s) are processed by a ViT encoder, with the resulting visual embeddings fed to an encoder-decoder backbone, along with embeddings from additional text input (e.g., question / prefix / prompt).

1.1. Visual Component

  • The visual backbone is scaled to 22B parameters.
  • An OCR-based pretraining is used as follows: images from the WebLI dataset [5] are annotated with OCR-text detected by GCP Vision API.
  • The encoder is then further pre-trained with a mixture of the original JFT-based classification task and a new OCR-based classification task.
  • PaLI-X is designed to take n >= 1 images as inputs (for few-shot and video understanding).
n-frame input at PaLI-X
  • For an n-frame input with k-patches per frame, the resulting visual input has nk tokens.

1.2. Overall Model

  • The encoder-decoder backbone is initialized from a variant of the UL2 encoder-decoder model that uses 32B parameters.
  • The architecture of this variant has 50 layers in both encoder and decoder (up from 32 layers in UL2), and is pretrained on a mixture of text data similar to UL2.
  • The visual embeddings, after going through a projection layer, are concatenated with the token embeddings of the text input, and fed to the encoder-decoder backbone.

1.3. Pretraining Data and Mixture

  • The main pretraining data for the model is based on WebLI [5], consisting of roughly one billion images with alt-texts from the web and OCR annotations (using the GCP Vision API), covering over 100 languages.
  • In addition to WebLI ⟨image, text⟩ pairs, Episodic WebLI data is introduced here, where each episode corresponds to a set of such pairs to encourage attention among examples in an “episode”. This new dataset (with 75M episodes and around 400M images in total) is important for developing the few-shot capabilities of the model.
  • The pretraining mixture consists of diverse data and objectives, e.g. object detection, captioning, and video question anwsering.

2. Results

By Scaling Up, PaLI-X Outperforms PaLI
  • Figure 1 Left: It is observed that scaling leads to large improvements over the results of the PaLI model, and also over specialized large-scale models that are trained specifically to solve certain tasks, often with the help of (often much larger) text-only LLMs.
  • Figure 1 Right: PaLI-X improves both state-of-the-art results and the Pareto frontier for fine-tuning and few-shot configurations.
Quanlitative Examples
Quanlitative Examples
  • (Please read the paper directly for the details of each experiment.)

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.