Brief Review — PaLM: Scaling Language Modeling with Pathways
2.1. Language Model
1991 … 2021 [Performer] [gMLP] [Roformer] [PPBERT] [DeBERTa] [DeLighT] [Transformer-LS] 2022 [GPT-NeoX-20B] [GPT-3.5, InstructGPT] [GLM] [MT-NLG 530B] [Chinchilla] 2023 [GPT-4]
2.2. Machine Translation
2013 … 2021 [ResMLP] [GPKD] [Roformer] [DeLighT] 2022 [DeepNet]
==== My Other Paper Readings Are Also Over Here ====
- Pathways Language Model (PaLM) is proposed, which is a 540-billion parameter, densely activated, Transformer language model.
- PaLM is trained on 6144 TPU v4 chips using Pathways, a new ML system which enables highly ecient training across multiple TPU Pods. It demonstrates continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks.
- (We have many animal and plant species large model, such as Jurassic-1, Gopher, Flamingo, and Chinchilla now, lol)
- Pathways Language Model (PaLM)
- English NLP Results
- BIG-Bench Results
- Reasoning Results
- Code Task Results
- Translation Results
- (This report/paper got 87 pages! For each result section, I only select few of them to show. Please feel free to read it if you’re interested.)
1. Pathways Language Model (PaLM)
1.1. Model Architecture
PaLM uses a standard Transformer model architecture in a decoder-only setup (i.e., each timestep can only attend to itself and past timesteps), with the following modifications:
- 1) SwiGLU activation: is used for the MLP intermediate activations because SwiGLU has been shown to significantly increase quality compared to standard ReLU, GELU, or Swish activations.
- 2) A “Parallel” formulation in each Transformer block: as in GPT-J-6B, is used, rather than the standard “serialized” formulation. Specifically, the standard formulation can be written as:
- The parallel formulation can be written as:
- The parallel formulation results in roughly 15% faster training speed at large scales. Ablation experiments showed a small quality degradation at 8B scale but no quality degradation at 62B scale.
- 3) Multi-Query Attention: The key/value projections are shared for each head, i.e. “key” and “value” are projected to [1, h], but “query” is still projected to shape [k, h]. It is found to have a neutral effect on model quality and training speed, but results in a significant cost savings at autoregressive decoding time.
- 4) RoPE embeddings: are used rather than absolute or relative position embeddings.
- 5) Shared Input-Output Embeddings: The input and output embedding matrices are shared, which is done frequently (but not universally) in past work.
- 6) No Biases: No biases were used in any of the dense kernels or layer norms. It is found to have increased training stability for large models.
- 7) Vocabulary: A SentencePiece vocabulary with 256k tokens is used, which was chosen to support the large number of languages in the training corpus without excess tokenization.
1.2. Model Variants
Three different model scales: 540B, 62B, and 8B parameters, are considered.
1.3. Training Dataset
- The PaLM pretraining dataset consists of a high-quality corpus of 780 billion tokens that represent a wide range of natural language use cases. The dataset is a mixture of filtered webpages, books, Wikipedia, news articles, source code, and social media conversations. This dataset is based on the datasets used to train LaMDA (Thoppilan et al., 2022) and GLaM (Du et al., 2021).
- All three models are trained on exactly one epoch of the data (shuffled identically for all models).
The final PaLM dataset mixture is tabulated above.
1.4. Training Infrastructure
- Pathways system executes the two-way pod-level data parallelism, as shown above.
In brief, the program contains a component A for within-pod forward+backward computation (including within-pod gradient reduction), transfer subgraph for cross-pod gradient transfer, and a component B for optimizer update (including summation of local and remote gradients).
The Pathways program executes component A on each pod, then transfers the output gradients to the other pod, and finally, executes component B on each pod.
- Thus, it masks the latency. Also, it amortizes the cost of managing data transfers.
- Authors also mentioned the practical setup in details, such as the hosts between the two pods are connected via the Google datacenter network. (Please read the paper directly if interested.)
PaLM represents a significant step forward in LLM training efficiency.
2. English NLP Results
- PaLM model is evaluated on the same set of 29 English benchmarks as Du et al. (2021) and Brown et al. (2020).
PaLM 540B outperforms prior SOTA on 24 of the 29 task in the 1-shot setting and 28 of the 29 tasks in the few-shot setting. Interestingly, PaLM 540B outperforms prior SOTA by more than 10 points in the few-shot setting on some of the Reading Comprehension and NLI tasks.
PaLM 540B outperforms a similar sized model (Megatron-Turing NLG 530B) on all benchmarks. This indicates that the pretraining dataset, training strategy, and the number of tokens observed during training also play a significant role in achieving these results.
PaLM obtains competitive close-to-SOTA performance.
3. BIG-Bench Results
5-shot PaLM 540B outperforms the prior SOTA on 44 out of the 58 common tasks, with per-task results shown as above.
4. Reasoning Results
- Reasoning tasks are tasks, which require multi-step arithmetic or commonsense logical reasoning to produce the correct answer.
PaLM 540B achieves a performance of 58%, which outperforms the prior SOTA of 55% from Cobbe et al. (2021) as shown above.
5. Code Task Results
- Some examples of code task datasets, are shown above.
- PaLM-Coder is PaLM with further fine-tuning on codes with 2 stages.
The performance of PaLM-Coder 540B increases even further, achieving 88.4% pass@100 on HumanEval and 80.8% pass@80 on MBPP.
6. Translation Results
Left: PaLM outperforms all the baselines, at times very decisively with up to 13 BLEU difference.
Right: Scaling PaLM from 62B to 540B results in several drastic jumps in BLEU scores that do not follow the “power law” rule of thumb.
- There are still other results, e.g.: Multilingual Natural Language Generation, Multilingual Question Answering.
- Also, other issues are discussed, e.g.: Memorization, Dataset Cotamination, Bias, Ethical issues, Open Questions.
- Please feel free to read the paper directly if interested.