Brief Review — The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

Flan 2022: Flan-T5

Sik-Ho Tsang
5 min readDec 11, 2023
Comparing public instruction tuning collections

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
Flan 2022, Flan-T5
, by Google Research
2023 arXiv v1, Over 170 Citations (Sik-Ho Tsang @ Medium)

Large Langauge Model (LLM)
2020 … 2023 [GPT-4] [LLaMA] [Koala] [BloombergGPT] [GLM-130B] [UL2] [PaLM 2] [Llama 2] [MultiMedQA, HealthSearchQA, Med-PaLM] [Med-PaLM 2]
==== My Other Paper Readings Are Also Over Here ====

  • Through careful ablation studies on the Flan Collection of tasks and methods, authors enable Flan-T5 to outperform prior work by 3 to 17%+ across evaluation settings.


  1. Timeline of Public Instruction Tuning Collections
  2. Flan 2022 Instruction Tuning Experiments

1. Timeline of Public Instruction Tuning Collections

A Timeline of Public Instruction Tuning Collections
  • The First Wave: Since 2020, several instruction tuning task collections have been released in rapid succession.
  • The Second Wave: In 2022, prior resources are expanded by combining more datasets and tasks into one resource.
  • Some new directions: involves synthetic data generation, and offering human feedback signals on model responses.
  • Tuning with Human Feedback: Instruction tuning on human feedback has demonstrated strong results on open-ended tasks, but at the expense of performance on a wide array of more traditional NLP tasks

This work focuses specifically on instruction generalization, without human feedback, for two reasons. First, human feedback datasets are far less publicly available than instruction tuning datasets (and may be model specific). Second, by itself, instruction generalization shows great promise.

2. Flan 2022 Instruction Tuning Experiments

2.1. Method Ablations

  • Table 1 summarizes the mean contribution to Held-in, Held-out, and Chain-of-thought tasks, by individually deducting methods: mixture weight balancing (“- Mixture Balancing”), Chain-of-thought tasks (“- CoT”), mixed prompt settings (“- Few Shot Templates”), and Input Inversion (“- Input Inversion”).
  • As compared to T5-XL models trained on alternative instruction tuning collections (and their methods), Flan outperforms in almost every setting.

While previous collections are tuned specifically to zero-shot prompts, Flan-T5 XL is tuned for either zero- or few-shot prompts. This yields performance margins of +3–10% for most of the zero-shot settings, and margins of 8–17% for the few-shot settings.

Most impressively, Flan 2022 outperforms OPT-IML-Max’s much larger (10x) 30B and (58x) 175B models.

2.2. Training with Mixed Prompt Settings

Training with Mixed Prompt Settings

Figure 3 shows (1) adding as little as 5% few-shot training templates can dramatically improve zero-shot performance, and (2) adding 10%+ of zero-shot data improves few-shot performance too.

2.3. Scaling Small Models to 1.8k+ Tasks

Scaling Small Models to 1.8k+ Tasks
  • T5+LM adapted models (Small, Base, Large, XL, XXL) are fine-tuned on randomly selected task subsets (8, 25, 50, 100, 200, 400, 800, all 1873).

Figure 4 demonstrates that both Held-In and Held-Out tasks appear to benefit from adding hundreds of finetuning tasks.

2.4. Task Enrichment with Input Inversion

  • Input inversion instead gives a model the answer y and trains it to generate the question x. This is an easy method to enrich the task variety.

In Table 1, it is found that this is not beneficial for Held-In performance, but strongly beneficial for Held-Out performance.

2.5. Weighted Mixture

Weighted Mixture
  • To converge on a balanced weighting, different set of task sources is omitted one at a time (Flan 2021, T0-SF, Super-Natural Instructions, Chain-of-Thought, Dialog, and Program Synthesis), and rank their contributions on the MMLU benchmark.

The results suggest the mixture weighting deserves as much attention to optimize results.

2.6. Instruction Tuning Enhances Single-Task Finetuning

  • Flan 2022 instruction tuning is evaluated as an intermediary step before single target finetuning, to understand if Flan-T5 would serve as a better starting checkpoint for applied practitioners.
  • 3 settings: finetuning T5 directly on the target task as the conventional baseline (blue bars), using Flan-T5 without further finetuning (beige bars), and finetuning Flan-T5 further on the target task (red bars).

For both sets of Held-In and Held-Out tasks examined, finetuning Flan-T5 offers a pareto improvement over finetuning T5 directly. In some instances, usually where finetuning data is limited for a task, Flan-T5 without further finetuning outperforms T5 with task finetuning.

2.6. Faster Convergence & Computational Benefits

Faster Convergence & Computational Benefits

As demonstrated in Figure 6, Flan-T5 converges much more quickly than T5 during single target finetuning, as well as peaking at higher accuracies.

Instruction-tuned models offer a promising solution to significantly reduce the amount of finetuning steps across a wide swathe of tasks, if they are adopted as a new standard starting point for single-task finetuning.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.