Review — FLAN: Finetuned Language Models Are Zero-Shot Learners
LM Tuning / Prompting
2020 [Human Feedback Model] 2021 [T5+LM, Prompt Tuning] 2022 [GPT-3.5, InstructGPT] [LoRA] [Chain-of-Thought Prompting] [T0] 2023 [LIMA]
==== My Other Paper Readings Are Also Over Here ====
- Finetuned LAnguage Net (FLAN) is proposed, which is an instruction tuning approach to fine-tune language models on a collection of datasets described via instructions.
1.1. Conceptual Idea
- (a) Pretrain — Finetune (BERT, T5): We need many task-specific examples for tuning.
- (b) Prompting (GPT-3): We provide (prompt) examples during inference to improve the performance.
(c) Instruction Tuning (FLAN): is a simple method that, as depicted above, combines appealing aspects of both the pretrain–finetune and prompting paradigms by using supervision via finetuning to improve language model’s responses to inference-time text interactions.
62 text datasets are aggregated as above. They are publicly available on Tensorflow Datasets, including both language understanding and language generation tasks, into a single mixture. Each dataset is categorized into one of 12 task clusters.
- In this work, Dataset D is considered as unseen at evaluation time if no datasets from any task clusters that D belongs to were seen during instruction tuning.
- The output space for a given task is either one of several classes (classification) or free text (generation). An options suffix is included, in which the token OPTIONS is appended to the end of a classification task as example in the figure at the top.
1.3. Instruction Templates
- For each dataset, 10 unique templates are manually composed that use natural language instructions to describe the task for that dataset.
Therefore, a pretrained language model is instruction tuned on the mixture of all datasets, with examples in each dataset formatted via a randomly selected instruction template for that dataset.
1.4. Some Training Details
- FLAN is the instruction-tuned version of LaMDA-PT. The instruction tuning pipeline mixes all datasets and randomly samples from each dataset as described.
- To balance the different sizes of datasets, the number of training examples per dataset is limited to 30k and the examples-proportional mixing scheme from T5 is followed with a mixing rate maximum of 3k.
- All models are fine-tuned for 30k gradient steps with a batch size of 8,192 tokens using the Adafactor Optimizer with a learning rate of 3e-5.
- The input and target sequence lengths used in finetuning are 1024 and 256, respectively.
- Packing is used to combine multiple training examples into a single sequence, separating inputs from targets using a special EOS token.
- This instruction tuning takes around 60 hours on a TPUv3 with 128 cores.
2.1. Zero-Shot Performance
- Zero-shot performances of GPT-3 175B and GLaM 64B/64E (Du et al., 2021), are used for comparisons.
Zero-shot FLAN outperforms zero-shot GLaM on 13 of 19 available datasets and one-shot GLaM on 11 of 19 datasets.
2.2. Ablation Studies
Average performance across the three held-out clusters improves as we add additional clusters and tasks to instruction tuning (with the exception of the sentiment analysis cluster), confirming the benefits of the proposed instruction tuning approach on zero-shot performance on novel tasks.
For the two models on the order of 100B parameters, instruction tuning substantially improves performance on held-out tasks.
For the 8B and smaller models, however, is thoughtprovoking — instruction tuning actually hurts performance on held-out tasks.
- In a no template setup, only inputs and outputs were given to the model (e.g., for translation the input would be “The dog runs.” and the output would be “Le chien court.”).
- In a dataset name setup, each input is prepended with the name of the task and dataset (e.g., for translation to French, the input would be “[Translation: WMT’14 to French] The dog runs.”).
- FLAN’s finetuning procedure, used natural instructions (e.g., “Please translate this sentence to French: ‘The dog runs.’”).
Both ablation configurations performed substantially worse than FLAN, indicating that training with instructions is crucial for zero-shot performance on unseen tasks.
2.3. Combine With Few-Shot
- Given k few-shot exemplars (xi, yi) with i=1, …, k and a new input x, the instruction format for the few-shot setting is “instruct (x1) ⨁ y1 ⨁ instruct(x2) ⨁ y2 ⨁ …⨁ instruct(xk) ⨁ yk ⨁instruct(x)”, where denotes string concatenation with a delimiter token inserted in between.
Few-shot exemplars improve the performance on all task clusters, compared with zero-shot FLAN.