Brief Review — SUPER-NATURALINSTRUCTIONS: Generalization via Declarative Instructions on 1600+ NLP Tasks

Tk-INSTRUCT, Trained Using Introduced SUPER-NATURALINSTRUCTIONS (SUP-NATINST), Outperforms InstructGPT

Sik-Ho Tsang
4 min readSep 4, 2023

SUPER-NATURALINSTRUCTIONS: Generalization via Declarative Instructions on 1600+ NLP Tasks
Tk-INSTRUCT, by Numerous Organizations and Researchers
2022 EMNLP, Over 90 Citations (Sik-Ho Tsang @ Medium)

LM Tuning / Prompting
2020 [Human Feedback Model] 2021 [T5+LM, Prompt Tuning] 2022 [GPT-3.5, InstructGPT] [LoRA] [Chain-of-Thought Prompting] [T0] [FLAN] [UL2R, U-PaLM] [Flan-PaLM] 2023 [LIMA]
==== My Other Paper Readings Are Also Over Here ====

  • SUPER-NATURALINSTRUCTIONS (SUP-NATINST) is introduced, which is a benchmark of 1,616 diverse NLP tasks and their expert-written instructions. The collection covers 76 distinct task types.
  • Tk-INSTRUCT is then built, a Transformer model trained to follow a variety of in-context instructions (plain language task definitions or k-shot examples). It is found that Tk-INSTRUCT outperforms existing instruction-following models such as InstructGPT by over 9%.

Outline

  1. SUPER-NATURALINSTRUCTIONS (SUP-NATINST)
  2. Tk-INSTRUCT
  3. Results

1. SUPER-NATURALINSTRUCTIONS (SUP-NATINST)

1.1. Format

An example task
  • SUPER-NATURALINSTRUCTIONS is a meta-dataset consisting of a variety of NLP tasks and instructions.
  • All task instructions follow the same uniform schema:
  1. Definition: This is a complete definition of how an input text (e.g., a sentence or a document) is expected to be mapped to an output text.
  2. Positive Examples: are samples of inputs and their correct outputs, along with a short explanation for each.
  3. Negative Examples: are samples of inputs and their incorrect/invalid outputs, along with a short explanation for each.
  • A unified format is used to organize the instances of all tasks. More precisely, each instance consists of a textual input and a list of acceptable textual outputs. the number of instances in each task is limited to 6.5K to avoid an imbalance of instances between tasks.

1.2. Data Collection

Statistics
  • The benchmark was collected through a large community effort on GitHub. There are 88 contributors in total.
  • The dataset includes 1616 tasks and 5M instances.
  • On average, each instruction is paired with 2.8 positive and 2.4 negative examples.
  • The average definition length is 56.6 in words.
Dataset Comparisons

It can be seen that SUP-NATINST is much larger and diverse.

2. Tk-INSTRUCT

  • Each task t is defined via its natural language instruction It, and each task has a set of input/output instances (Xt, Yt).
  • A model M is expected to produce the output y, given the input x and the task instruction It: M(It, x) = y, for (x, y) ∈ (Xt, Yt).
  • For generalization on unseen task, in particular, we would like to evaluate model M on tasks that are not observed.

Tk-INSTRUCT is introduced, a model that is meta-trained on SUP-NATINST for solving tasks given their in-context instructions.

  • Tk-INTRUCT model is constructed based on T5.
  • Multilingual variant mTk-INSTRUCT is constructed based on the mT5.

3. Results

3.1. ROUGE-L

ROUGE-L on Unseen Tasks

Instruction-tuning enables stronger generalization to unseen tasks.

  • Generally instruction-tuned models perform better compared to their untuned LM counterparts (Tk-INSTRUCT vs. T5+LM, InstructGPT vs. GPT-3) and heuristic baselines. This indicates models do learn to follow instructions by fine-tuning on instruction data, and this can generalize to new instructions for unseen tasks.

Tk-INSTRUCT outperforms InstructGPT.

3.2. Human Evaluation

  • There is a sizable gap between the generalization of instruction-based models and the supervised training approach, leaving more room for improvement.
Human Evaluation
  • Crowdworkers are asked to indicate if they prefer the predicted answer by the model or the ground truth outputs for each instance with ties being allowed.

The results of human evaluation align quite well with the automatic metrics and confirm the human-perceived quality of the proposed models.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.