Brief Review — PaLI-X: On Scaling up a Multilingual Vision and Language Model
PaLI-X, By Scaling Up PaLI
3 min readJul 4, 2024
PaLI-X: On Scaling up a Multilingual Vision and Language Model
PaLI-X, by Google Research
2023 arXiv v1, Over 80 Citations (Sik-Ho Tsang @ Medium)Visual/Vision/Video Language Model (VLM)
2017 … 2023 [GPT-4] [GPT-4V(ision)] [MultiModal-CoT] [CoCa] [Florence-2] [PaLI]
==== My Other Paper Readings Are Also Over Here ====
Outline
- PaLI-X
- Results
1. PaLI-X
- The PaLI-X model architecture follows the encoder-decoder architecture: image(s) are processed by a ViT encoder, with the resulting visual embeddings fed to an encoder-decoder backbone, along with embeddings from additional text input (e.g., question / prefix / prompt).
1.1. Visual Component
- The visual backbone is scaled to 22B parameters.
- An OCR-based pretraining is used as follows: images from the WebLI dataset [5] are annotated with OCR-text detected by GCP Vision API.
- The encoder is then further pre-trained with a mixture of the original JFT-based classification task and a new OCR-based classification task.
- PaLI-X is designed to take n >= 1 images as inputs (for few-shot and video understanding).
- For an n-frame input with k-patches per frame, the resulting visual input has n ∗ k tokens.
1.2. Overall Model
- The encoder-decoder backbone is initialized from a variant of the UL2 encoder-decoder model that uses 32B parameters.
- The architecture of this variant has 50 layers in both encoder and decoder (up from 32 layers in UL2), and is pretrained on a mixture of text data similar to UL2.
- The visual embeddings, after going through a projection layer, are concatenated with the token embeddings of the text input, and fed to the encoder-decoder backbone.
1.3. Pretraining Data and Mixture
- The main pretraining data for the model is based on WebLI [5], consisting of roughly one billion images with alt-texts from the web and OCR annotations (using the GCP Vision API), covering over 100 languages.
- In addition to WebLI ⟨image, text⟩ pairs, Episodic WebLI data is introduced here, where each episode corresponds to a set of such pairs to encourage attention among examples in an “episode”. This new dataset (with 75M episodes and around 400M images in total) is important for developing the few-shot capabilities of the model.
- The pretraining mixture consists of diverse data and objectives, e.g. object detection, captioning, and video question anwsering.
2. Results
- Figure 1 Left: It is observed that scaling leads to large improvements over the results of the PaLI model, and also over specialized large-scale models that are trained specifically to solve certain tasks, often with the help of (often much larger) text-only LLMs.
- Figure 1 Right: PaLI-X improves both state-of-the-art results and the Pareto frontier for fine-tuning and few-shot configurations.
- (Please read the paper directly for the details of each experiment.)