Brief Review — Flan-PaLM: Scaling Instruction-Finetuned Language Models
Flan-PaLM, PaLM Fine-Tuned Using FLAN
Scaling Instruction-Finetuned Language Models
Flan-PaLM, by Google
2022 arXiv v5, Over 390 Citations (Sik-Ho Tsang @ Medium)LM Tuning / Prompting
2020 [Human Feedback Model] 2021 [T5+LM, Prompt Tuning] 2022 [GPT-3.5, InstructGPT] [LoRA] [Chain-of-Thought Prompting] [T0] [FLAN] 2023 [LIMA]
==== My Other Paper Readings Are Also Over Here ====
- FLAN, a method of finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks.
- In this paper, FLAN instruction finetuning is explored more using LLM with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) fine-tuning on chain-of-thought (CoT) data.
- This article is published in 2022, they forecast 2023 & 2024 results using Hypermind and and Metaculus forecasts.
1. Flan-PaLM
- FLAN is used for instruction tuning of larger model scale and larger data scale in this paper.
1.1. Fine-Tuning Data
- Prior literature has shown that increasing the number of tasks in finetuning with instructions improves generalization to unseen tasks.
In this paper, it is scaled to 1,836 finetuning tasks by combining 4 mixtures from prior work: Muffin, T0-SF, NIV2, and CoT.
- For Muffin, T0-SF, and NIV2, instructional templates are used for each task as given by the creators of the mixtures.
- For CoT, authors manually write around ten instruction templates for each of the nine datasets.
- To create few-shot templates, a variety of exemplar delimiters is written (e.g., “Q:”/”A:”) and applied randomly at the example level.
An example of formatting for both with and without exemplars, as well as with and without chain-of-thought (CoT), is shown in the above Figure 3.
1.2. Models & Sizes
- A broad range of model families is used, including T5, PaLM, and U-PaLM. These model families span a range of sizes, from Flan-T5-small (80M parameters), to PaLM and U-PaLM (540B parameters).
- Packing technique in T5 is used to combine multiple training examples into a single sequence.
Notably, the amount of compute used for finetuning is only a small fraction relative to the training compute, as shown in Table 2. For example, only 0.2% of the pre-training compute is used to instruction-finetune Flan-PaLM 540B.
2. Results
2.1. Instruction Finetuning
For all three model sizes shown, multi-task instruction finetuning improves performance by a large margin compared to no finetuning. The performance gain ranges from 9.4% to 15.5%.
2.2. CoT + Instruction Finetuning
Including nine datasets with chain-of-thought (CoT) annotations in the finetuning mixture improves reasoning ability. CoT prompting abilities of Flan-PaLM outperform PaLM on the four held-out evaluation benchmarks.
2.3. Zero-Shot Performance
The BBH benchmark of 23 unseen challenging BIG-Bench tasks, Flan-PaLM models can achieve improved performance by leveraging CoT reasoning activated by the phrase “let’s think step-by-step”.
2.4. Other Models
Instruction finetuning improves normalized average performance by a large margin for all model types.
2.5. Zero-Shot Prompting
Compared with Flan-PaLM, the original PaLM struggles with repetitions and not replying to instructions in the zero-shot setting.