Brief Review — The Power of Scale for Parameter-Efficient Prompt Tuning
The Power of Scale for Parameter-Efficient Prompt Tuning
Prompt Tuning, T5+LM, by Google Research
2021 EMNLP, Over 1100 Citations (Sik-Ho Tsang @ Medium)
- Prompt Tuning, a simple yet effective mechanism, is proposed for learning “soft prompts” to condition frozen language models to perform specific downstream tasks.
- Unlike the discrete text prompts used by GPT-3, soft prompts are learned through backpropagation and can be tuned to incorporate signals from any number of labeled examples.
- Prompt Tuning
1. Prompt Tuning
- Model tuning is to fine-tune the whole model, which is expensive and not effective, especially when the model is large language model.
1.1. Prompt Tuning
- Suppose the conditional generation is Prθ(y|X), where X is a series of tokens and y is a single class label. θ is the model parameter.
- Normally, prompting is done by prepending a series of tokens, P, to the input X, such that the model maximizes the likelihood of the correct Y, Prθ(y|[P; X]).
Prompt tuning can be thought of as using a fixed prompt of special tokens, where only the embeddings of these prompt tokens, θp, can be updated. The new conditional generation is now Prθ,θp(y|[P; X]).
- Soft-prompts are represented as a parameter Pe. The prompt is then concatenated to the embedded input forming a single matrix [Pe, Xe] which then flows though the encoder-decoder as normal.
The models are trained to maximize the probability of Y, but only the prompt parameters Pe are updated. The pretrained model parameters, θ, are frozen.
1.3. LM Adaptation
- Since T5 is pre-trained on a span corruption objective, in which target output text consists of all the masked content, separated by sentinels, plus a final sentinel.
- Using LM Adaptation, T5’s self-supervised training is performed for a small number of additional steps, up to 100K steps, but using the autoregressive “LM” objective, to remove sentinel at output.
Prompt tuning becomes more competitive with model tuning as scale increases. At the XXL size (11 billion parameters), prompt tuning matches even the stronger multi-task model tuning baseline, despite having over 20,000 times fewer task-specific parameters.