Brief Review — PaLM: Scaling Language Modeling with Pathways

540 Billion Parameters Large Language Model (LLM)

Sik-Ho Tsang
6 min readApr 15, 2023
PaLM Trees (Image from Pexels: MarcTutorials)

PaLM: Scaling Language Modeling with Pathways,
PaLM, by Google Research,
2022 arXiv v5, Over 600 Citations (Sik-Ho Tsang @ Medium)
NLP, NMT, LLM, Language Model, Transformer

2.1. Language Model
1991 … 2021 [Performer] [gMLP] [Roformer] [PPBERT] [DeBERTa] [DeLighT] [Transformer-LS] 2022 [GPT-NeoX-20B] [GPT-3.5, InstructGPT] [GLM] [MT-NLG 530B] [Chinchilla] 2023 [GPT-4]
2.2. Machine Translation
2013 2021 [ResMLP] [GPKD] [Roformer] [DeLighT] 2022 [DeepNet]
==== My Other Paper Readings Are Also Over Here ====

  • Pathways Language Model (PaLM) is proposed, which is a 540-billion parameter, densely activated, Transformer language model.
  • PaLM is trained on 6144 TPU v4 chips using Pathways, a new ML system which enables highly ecient training across multiple TPU Pods. It demonstrates continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks.
  • (We have many animal and plant species large model, such as Jurassic-1, Gopher, Flamingo, and Chinchilla now, lol)

Outline

  1. Pathways Language Model (PaLM)
  2. English NLP Results
  3. BIG-Bench Results
  4. Reasoning Results
  5. Code Task Results
  6. Translation Results
  • (This report/paper got 87 pages! For each result section, I only select few of them to show. Please feel free to read it if you’re interested.)

1. Pathways Language Model (PaLM)

1.1. Model Architecture

PaLM uses a standard Transformer model architecture in a decoder-only setup (i.e., each timestep can only attend to itself and past timesteps), with the following modifications:

  • 1) SwiGLU activation: is used for the MLP intermediate activations because SwiGLU has been shown to significantly increase quality compared to standard ReLU, GELU, or Swish activations.
  • 2) A “Parallel” formulation in each Transformer block: as in GPT-J-6B, is used, rather than the standard “serialized” formulation. Specifically, the standard formulation can be written as:
  • The parallel formulation can be written as:
  • The parallel formulation results in roughly 15% faster training speed at large scales. Ablation experiments showed a small quality degradation at 8B scale but no quality degradation at 62B scale.
  • 3) Multi-Query Attention: The key/value projections are shared for each head, i.e. “key” and “value” are projected to [1, h], but “query” is still projected to shape [k, h]. It is found to have a neutral effect on model quality and training speed, but results in a significant cost savings at autoregressive decoding time.
  • 4) RoPE embeddings: are used rather than absolute or relative position embeddings.
  • 5) Shared Input-Output Embeddings: The input and output embedding matrices are shared, which is done frequently (but not universally) in past work.
  • 6) No Biases: No biases were used in any of the dense kernels or layer norms. It is found to have increased training stability for large models.
  • 7) Vocabulary: A SentencePiece vocabulary with 256k tokens is used, which was chosen to support the large number of languages in the training corpus without excess tokenization.

1.2. Model Variants

Model architecture details.

Three different model scales: 540B, 62B, and 8B parameters, are considered.

1.3. Training Dataset

Proportion of data from each source in the training dataset.
  • The PaLM pretraining dataset consists of a high-quality corpus of 780 billion tokens that represent a wide range of natural language use cases. The dataset is a mixture of filtered webpages, books, Wikipedia, news articles, source code, and social media conversations. This dataset is based on the datasets used to train LaMDA (Thoppilan et al., 2022) and GLaM (Du et al., 2021).
  • All three models are trained on exactly one epoch of the data (shuffled identically for all models).
  • In addition to natural language data, the pretraining dataset also contains 196GB code, obtained from open source repositories on GitHub, including Java, HTML, Javascript, Python, PHP, C#, XML, C++, and C.

The final PaLM dataset mixture is tabulated above.

1.4. Training Infrastructure

The Pathways system (Barham et al., 2022) scales training across two TPU v4 pods using two-way data parallelism at the pod level.
  • Pathways system executes the two-way pod-level data parallelism, as shown above.

In brief, the program contains a component A for within-pod forward+backward computation (including within-pod gradient reduction), transfer subgraph for cross-pod gradient transfer, and a component B for optimizer update (including summation of local and remote gradients).

The Pathways program executes component A on each pod, then transfers the output gradients to the other pod, and finally, executes component B on each pod.

  • Thus, it masks the latency. Also, it amortizes the cost of managing data transfers.
  • Authors also mentioned the practical setup in details, such as the hosts between the two pods are connected via the Google datacenter network. (Please read the paper directly if interested.)
Model FLOPs utilization of PaLM and prior large models.

PaLM represents a significant step forward in LLM training efficiency.

2. English NLP Results

Results obtained by the PaLM 540B model across 29 NLP benchmarks.
  • PaLM model is evaluated on the same set of 29 English benchmarks as Du et al. (2021) and Brown et al. (2020).

PaLM 540B outperforms prior SOTA on 24 of the 29 task in the 1-shot setting and 28 of the 29 tasks in the few-shot setting. Interestingly, PaLM 540B outperforms prior SOTA by more than 10 points in the few-shot setting on some of the Reading Comprehension and NLI tasks.

PaLM 540B outperforms a similar sized model (Megatron-Turing NLG 530B) on all benchmarks. This indicates that the pretraining dataset, training strategy, and the number of tokens observed during training also play a significant role in achieving these results.

Results on SuperGLUE dev set.

PaLM obtains competitive close-to-SOTA performance.

3. BIG-Bench Results

BIG-bench evaluation of PaLM. (left) Evaluation of PaLM, GPT-3, Gopher, and Chinchilla. (right) Evaluation of PaLM on a larger set of 150 BIG-bench tasks.

On 58 tasks, PaLM significantly outperforms both GPT-3, Gopher, and Chinchilla, and 5-shot PaLM 540B achieves a higher score than the average score of the humans asked to solve the same tasks.

Distribution of score difference in “normalized preferred metric” between PaLM 540B 5-shot and the prior SOTA across a common subset of 58 BIG-bench text tasks.

5-shot PaLM 540B outperforms the prior SOTA on 44 out of the 58 common tasks, with per-task results shown as above.

4. Reasoning Results

Chain of thought prompting allows language models to better perform multi-step reasoning tasks such as math word problems.
  • Reasoning tasks are tasks, which require multi-step arithmetic or commonsense logical reasoning to produce the correct answer.
8-shot evaluation of PaLM on GSM8K with chain-of-thought in comparison to prior SOTA.

PaLM 540B achieves a performance of 58%, which outperforms the prior SOTA of 55% from Cobbe et al. (2021) as shown above.

5. Code Task Results

Examples from the PaLM-Coder 540B model. (top left) GSM8K-Python question converted from the OpenAI GSM8K math dataset. (bottom left) TransCoder example translating a simple function from C++ to Python. (right) Converted HumanEval example.
  • Some examples of code task datasets, are shown above.
Results obtained by the PaLM 540B and PaLM-Coder 540B models across code synthesis and software engineering tasks.
  • PaLM-Coder is PaLM with further fine-tuning on codes with 2 stages.

The performance of PaLM-Coder 540B increases even further, achieving 88.4% pass@100 on HumanEval and 80.8% pass@80 on MBPP.

6. Translation Results

Comparison of PaLM on 0-shot translation tasks. (left) Comparison with previous large language models. (right) Comparison of different PaLM model scales.

Left: PaLM outperforms all the baselines, at times very decisively with up to 13 BLEU difference.

Right: Scaling PaLM from 62B to 540B results in several drastic jumps in BLEU scores that do not follow the “power law” rule of thumb.

Others

  • There are still other results, e.g.: Multilingual Natural Language Generation, Multilingual Question Answering.
  • Also, other issues are discussed, e.g.: Memorization, Dataset Cotamination, Bias, Ethical issues, Open Questions.
  • Please feel free to read the paper directly if interested.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.