Brief Review — UL2: Unifying Language Learning Paradigms
Mixture-of-Denoisers (MoD), Outperforms T5 and/or GPT-Like Models Over 50 NLP Tasks, Outperforms LaMDA & PaLM.
UL2: Unifying Language Learning Paradigms,
UL2, by Google Research, Brain Team,
2023 ICLR (Sik-Ho Tsang @ Medium)Language Model
1991 … 2022 [GPT-NeoX-20B] [GPT-3.5, InstructGPT] [GLM] [MT-NLG 530B] [Chinchilla] [PaLM] [AlexaTM] [BLOOM] [AlexaTM 20B] [OPT] [Switch Transformers] [LaMDA] [LoRA] [Galactica] [WideNet] [MoEBERT] 2023 [GPT-4] [LLaMA] [LIMA] [Koala] [BloomBergGPT] [GLM-130B]
==== My Other Paper Readings Are Also Over Here ====
- Mixture-of-Denoisers (MoD) is proposed, which is a pretraining objective that combines diverse pre-training paradigms together.
- Mode switching is suggested, wherein downstream fine-tuning is associated with specific pre-training schemes.
- By scaling up to 20B, UL2 20B outperforms LaMDA 137B.
Outline
- UL2
- Results
1. UL2
1.1. Unified Perspective
- Language models: use all previous time-steps as inputs to the model to predict the next token, which is the target.
- Span corruption: the model leverages all uncorrupted tokens from the past and future as inputs for predicting the corrupted span (targets).
- Prefix-LMs: are LMs that use past tokens as inputs, but consume the inputs bidirectionally.
UL2 approximately reduces into one pre-training objective.
1.2. Denoising Task
The inputs and targets of the denoising tasks are generated by a SPANCORRUPT function that is parameterized by three values (μ, r, n), where μ is the mean span length, r is the corruption rate, and n which is number of corrupted spans. Note that n may be a function of the input length, L, and the span length μ, e.g. Lμ, but in some cases, a fixed value of n can be used.
- Given an input text, SPANCORRUPT introduces corruptions to the spans of lengths that are drawn from a (normal or uniform) distribution with mean of μ.
1.3. Mixture of Denoisers (MoD)
1.3.1. R-Denoiser
The regular denoising is the standard span corruption introduced in T5 that uses a range of 2 to 5 tokens as the span length, which masks about 15% of input tokens.
- These spans are short and potentially useful to acquire knowledge instead of learning to generate fluent text.
1.3.2. S-Denoiser
The input sequence is simply partitioned into two sub-sequences of tokens as context and target.
- Similar to the Prefix-LM setup, the context (prefix) retains a bidirectional receptive field.
1.3.3. X-Denoiser
An extreme version of denoising where the model must recover a large part of the input, given a small to moderate part of it.
- This simulates a situation where a model needs to generate long target from a memory with relatively limited information.
1.3.4. MoD Configuration
The final objective is a mixture of 7 denoisers that are configured as above.
1.4. Mode Switching
During pre-training, the model is fed with an extra paradigm token, i.e., {[R], [S], [X]} that helps the model switch gears and operate on a mode that is more suitable for the given task.
- For fine-tuning and downstream few-shot learning, to trigger the model to learn better solutions, a paradigm token is also added with respect to the setups and requirements of the downstream task. Mode switching in fact binds downstream behavior to one of the modes we used during upstream training.
1.5. Model Architecture
- UL2 adopts a pretty standard vanilla T5 Transformer.
Both UL2 decoder and UL2 encoder-decoder are constructed.
2. Results
2.1. Metric
- To enable the comparison between models from this perspective, an aggregate performance score is needed. However, metrics on different tasks we include are widely different in nature — take, for example, F1 and perplexity.
To address this, the normalized relative gain with respect to baselines is used as an overall metric.
2.2. Ablation Studies
- For pre-training objectives, UL2 compares with the following pre-training baselines: (1) Causal Language Model, (2) Prefix Language model, (3) Span Corruption (SC), (4) Span Corruption + LM (SCLM), and (5) UniLM.
- All models are comparable in terms of computational costs, i.e., FLOPs (EncDec models are 300M and Dec models are 150M parameters).
UL2 objective obtains the best and comparable results.
- When using T5 as the reference baseline, with the exception of UL2 Decoder, none of the pre-trained decoders models outperform T5. Additionally, there is a 10% to 30% degradation in overall relative performance. The best decoder baseline model here is the Prefix-LM decoder model, which is about 10% worse than the T5 baseline.
Using UL2 objective is able to push the UL2 decoder to outperform the T5 encoder-decoder setup by +14.6%. That said, this UL2 decoder does not outperform our UL2 encoder-decoder.
2.3. Scaling to 20B Parameters
2.3.1. Zero-Shot on SuperGLUE
- A 20B UL2 model is trained.
UL2 20B outperforms GPT-3 and other compute-matched models on zero-shot NLU.
2.3.2. One-Shot Summarization
The performance of UL2 20B is about 3× the performance of LM adapted T5 XXL model. Moreover, UL2 20B outperform LaMDA 137B and has better performance compared to PaLM 8B which is approximately compute-matched with UL2.
- The best result, however, is still the larger 540B and 62B PaLM models.
2.3.3. Chain-of-Thought Prompting
UL2 20B is capable of performing chain-of-thought reasoning (Wei et al., 2022b). Whereas most prior success on chain-of-thought has been shown on large and non-public models, UL2 is a comparatively smaller and publicly available.
- Hope I can read about chain-of-thought reasoning (Wei et al., 2022b) later.