Brief Review — Flan-PaLM: Scaling Instruction-Finetuned Language Models

Flan-PaLM, PaLM Fine-Tuned Using FLAN

Sik-Ho Tsang
4 min readAug 29


Finetune language model on 1.8K tasks phrased as instructions, and evaluate them on unseen tasks.

Scaling Instruction-Finetuned Language Models
Flan-PaLM, by Google
2022 arXiv v5, Over 390 Citations (Sik-Ho Tsang @ Medium)

LM Tuning / Prompting
2020 [Human Feedback Model] 2021 [T5+LM, Prompt Tuning] 2022 [GPT-3.5, InstructGPT] [LoRA] [Chain-of-Thought Prompting] [T0] [FLAN] 2023 [LIMA]
==== My Other Paper Readings Are Also Over Here ====

  • FLAN, a method of finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks.
  • In this paper, FLAN instruction finetuning is explored more using LLM with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) fine-tuning on chain-of-thought (CoT) data.
Updated MMLU Results
  • This article is published in 2022, they forecast 2023 & 2024 results using Hypermind and and Metaculus forecasts.


  1. Flan-PaLM
  2. Results

1. Flan-PaLM

  • FLAN is used for instruction tuning of larger model scale and larger data scale in this paper.

1.1. Fine-Tuning Data

The finetuning data comprises 473 datasets, 146 task categories, and 1,836 total tasks.
  • Prior literature has shown that increasing the number of tasks in finetuning with instructions improves generalization to unseen tasks.

In this paper, it is scaled to 1,836 finetuning tasks by combining 4 mixtures from prior work: Muffin, T0-SF, NIV2, and CoT.

  • For Muffin, T0-SF, and NIV2, instructional templates are used for each task as given by the creators of the mixtures.
  • For CoT, authors manually write around ten instruction templates for each of the nine datasets.
Few-Shot Templates
  • To create few-shot templates, a variety of exemplar delimiters is written (e.g., “Q:”/”A:”) and applied randomly at the example level.

An example of formatting for both with and without exemplars, as well as with and without chain-of-thought (CoT), is shown in the above Figure 3.

1.2. Models & Sizes

  • A broad range of model families is used, including T5, PaLM, and U-PaLM. These model families span a range of sizes, from Flan-T5-small (80M parameters), to PaLM and U-PaLM (540B parameters).
  • Packing technique in T5 is used to combine multiple training examples into a single sequence.

Notably, the amount of compute used for finetuning is only a small fraction relative to the training compute, as shown in Table 2. For example, only 0.2% of the pre-training compute is used to instruction-finetune Flan-PaLM 540B.

2. Results

2.1. Instruction Finetuning

Instruction Finetuning on PaLM 8B, 62B, 540B

For all three model sizes shown, multi-task instruction finetuning improves performance by a large margin compared to no finetuning. The performance gain ranges from 9.4% to 15.5%.

2.2. CoT + Instruction Finetuning

CoT Performance

Including nine datasets with chain-of-thought (CoT) annotations in the finetuning mixture improves reasoning ability. CoT prompting abilities of Flan-PaLM outperform PaLM on the four held-out evaluation benchmarks.

2.3. Zero-Shot Performance

Zero-Shot Performance
“let’s think step-by-step” Examples

The BBH benchmark of 23 unseen challenging BIG-Bench tasks, Flan-PaLM models can achieve improved performance by leveraging CoT reasoning activated by the phrase “let’s think step-by-step”.

2.4. Other Models

All Model Types

Instruction finetuning improves normalized average performance by a large margin for all model types.

2.5. Zero-Shot Prompting

Zero-Shot Prompting

Compared with Flan-PaLM, the original PaLM struggles with repetitions and not replying to instructions in the zero-shot setting.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.