Brief Review — Flan-PaLM: Scaling Instruction-Finetuned Language Models
LM Tuning / Prompting
2020 [Human Feedback Model] 2021 [T5+LM, Prompt Tuning] 2022 [GPT-3.5, InstructGPT] [LoRA] [Chain-of-Thought Prompting] [T0] [FLAN] 2023 [LIMA]
==== My Other Paper Readings Are Also Over Here ====
- FLAN, a method of finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks.
- In this paper, FLAN instruction finetuning is explored more using LLM with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) fine-tuning on chain-of-thought (CoT) data.
- This article is published in 2022, they forecast 2023 & 2024 results using Hypermind and and Metaculus forecasts.
- FLAN is used for instruction tuning of larger model scale and larger data scale in this paper.
1.1. Fine-Tuning Data
- Prior literature has shown that increasing the number of tasks in finetuning with instructions improves generalization to unseen tasks.
In this paper, it is scaled to 1,836 finetuning tasks by combining 4 mixtures from prior work: Muffin, T0-SF, NIV2, and CoT.
- For Muffin, T0-SF, and NIV2, instructional templates are used for each task as given by the creators of the mixtures.
- For CoT, authors manually write around ten instruction templates for each of the nine datasets.
- To create few-shot templates, a variety of exemplar delimiters is written (e.g., “Q:”/”A:”) and applied randomly at the example level.
An example of formatting for both with and without exemplars, as well as with and without chain-of-thought (CoT), is shown in the above Figure 3.
1.2. Models & Sizes
- A broad range of model families is used, including T5, PaLM, and U-PaLM. These model families span a range of sizes, from Flan-T5-small (80M parameters), to PaLM and U-PaLM (540B parameters).
- Packing technique in T5 is used to combine multiple training examples into a single sequence.
2.1. Instruction Finetuning
For all three model sizes shown, multi-task instruction finetuning improves performance by a large margin compared to no finetuning. The performance gain ranges from 9.4% to 15.5%.
2.2. CoT + Instruction Finetuning
Including nine datasets with chain-of-thought (CoT) annotations in the finetuning mixture improves reasoning ability. CoT prompting abilities of Flan-PaLM outperform PaLM on the four held-out evaluation benchmarks.
2.3. Zero-Shot Performance
2.4. Other Models
Instruction finetuning improves normalized average performance by a large margin for all model types.