Review — PaLM 2 Technical Report

PaLM 2, Outperforms PaLM, Competitive With GPT-4

Sik-Ho Tsang
4 min readJul 8


PaLM 2 Introduction (by Google AI in

PaLM 2 Technical Report,
PaLM 2, by Google
2023 arXiv v1 (Sik-Ho Tsang @ Medium)

Language Model
1991 … 2022
[GPT-NeoX-20B] [GPT-3.5, InstructGPT] [GLM] [MT-NLG 530B] [Chinchilla] [PaLM] [AlexaTM] [BLOOM] [AlexaTM 20B] [OPT] [Switch Transformers] [LaMDA] [LoRA] [Galactica] [WideNet] [MoEBERT] 2023 [GPT-4] [LLaMA] [LIMA] [Koala] [BloomBergGPT] [GLM-130B] [UL2]
==== My Other Paper Readings Are Also Over Here ====

  • PaLM 2 is a Transformer-based model trained using a mixture of objectives similar to UL2.
  • PaLM 2 has significantly improved quality on downstream tasks across different model sizes, while simultaneously exhibiting faster and more efficient inference compared to PaLM. This improved efficiency enables broader deployment while also allowing the model to respond faster, for a more natural pace of interaction.


  1. Scaling Law Experiments & Model Variants
  2. Training Dataset
  3. Evaluations
PaLM 2 Introduction (by Google AI in YouTube)

1. Scaling Law Experiments & Model Variants

Scaling Law for 4 Model Scales
  • Several differently sized models are trained with 4 different compute budgets: 1×10¹⁹, 1×10²⁰, 1×10²¹, and 1×10²² FLOPs.

Similar findings to Chinchilla are found, training data (D) and model size (N) should grow in equal proportions as the FLOPs budget increases. Finally, several models from 400M to 15B are trained.

  • Three variants of PaLM 2: a Small (S), Medium (M), and Large (L) are trained. Unless indicated, normally L is mentioned.
  • (Recently, there is a tendency that authors/corporations keep their own secret sauce, not to disclose the details such as: model architecture, training objective & strategies, dataset, etc. for AI safety/security issue, and probably also for their profit issue.)

2. Training Dataset

The PaLM 2 pre-training corpus is composed of a diverse set of sources: web documents, books, code, mathematics, and conversational data.

  • PaLM 2 is trained on a dataset that includes a higher percentage of non-English data than previous large language models, which is beneficial for multilingual tasks.

PaLM 2 is also trained on parallel data covering hundreds of languages in the form of source and target text pairs where one side is in English.

  • Several data cleaning and quality filtering methods are employed.
  • Even though PaLM 2 has a smaller proportion of English data than PaLM, we still observe significant improvements on English evaluation datasets. Higher quality data is obtained.

PaLM 2 was trained to increase the context length for tasks such as long dialog, long-range reasoning and comprehension, summarization.

3. Evaluation

3.1. Language Proficiency Exams

Language Proficiency Exams

PaLM 2 outperforms PaLM across all exams and achieves a passing grade for every language, demonstrating language proficiency across all evaluated languages.

3.1. English QA and Classification

English QA and Classification

Even the smallest PaLM 2 variant achieves performance competitive with the much larger PaLM 540B model while PaLM 2-M already outperforms PaLM consistently.

3.3. Reasoning


PaLM 2 outperforms PaLM across all datasets and achieves results competitive with GPT-4.

3.4. Coding


PaLM 2-S* outperforms PaLM on all but two languages, with surprisingly little degradation on low-resource languages like Julia and Haskell; for instance PaLM 2-S* improves upon the much larger PaLM-Coder-540B by 6.3× on Haskell and on Julia by 4.7×.

Example of Fixing a Bug With Korean Comments Added

3.5. Translation


PaLM 2 improves quality both over PaLM and Google Translate.

Examples of Questions with Translations
  • (There are also natural language generation (NLG), memorization, toxicity & bias experiments. Please feel free to read the paper directly if interested.)



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.