Brief Review — Prefix-Tuning: Optimizing Continuous Prompts for Generation
Prefix-Tuning
Prefix-Tuning: Optimizing Continuous Prompts for Generation
Prefix-Tuning, by Stanford University
2021 ACL IJCNLP, Over 2600 Citations (Sik-Ho Tsang @ Medium)Language Model (LM)
2007 … 2022 [GLM] [Switch Transformers] [WideNet] [MoEBERT] [X-MoE] [sMLP] [LinkBERT, BioLinkBERT] [AlphaCode] 2023 [ERNIE-Code]
==== My Other Paper Readings Are Also Over Here ====
- Prefix-tuning is proposed, which is a lightweight alternative to fine-tuning for natural language generation tasks.
- Prefix-tuning keeps language model parameters frozen and instead optimizes a sequence of continuous task-specific vectors, which we call the prefix.
Outline
- Prefix-Tuning
- Results
1. Prefix-Tuning
- Top: Fine-tuning a full languge model (LM) is expensive.
- Bottom: On the contrary, intuitively for example, if we want the LM to generate a word (e.g., Obama), we can prepend its common collocations as context (e.g., Barack).
- If we keep the LM parameters frozen, then the number of paramters need to be tuned is very few compared to the one for full fine-tuning.
Prefix-tuning prepends a prefix for an autoregressive LM to obtain z = [PREFIX; x; y], or prepends prefixes for both encoder and decoder to obtain z = [PREFIX; x; PREFIX0; y]:
The language model parameters φ are fixed and the prefix parameters θ are the only trainable parameters.
- The prefix activations are always in the left context and will therefore affect any activations to the right.
- Direct fine-tuning of θ is unstable. Reparametrization parameters are introduced to avoid direct fine-tuning of θ. Once training is complete, these reparametrization parameters can be dropped, and only the prefix (P) needs to be saved.
2. Results
Prefix-tuning is significantly better than ADAPTER (0.1%), attaining 4:1 BLEU improvement per dataset on average.
- Left: 8 examples generated by both prefix-tuning and fine-tuning models trained on different data levels.
Right: Prefix-tuning outperforms fine-tuning in low-data regimes by 2.9 BLEU on average, in addition to requiring much fewer parameters, but the gap narrows as the dataset size increases.