Review — XLNet: Generalized Autoregressive Pretraining for Language Understanding
XLNet, Reduce Pretrain-Finetune Discrepancy
XLNet: Generalized Autoregressive Pretraining for Language Understanding
XLNet, by Carnegie Mellon University, and Google AI Brain Team
2019 NeurIPS, Over 4900 Citations (Sik-Ho Tsang @ Medium)
Natural Language Processing, NLP, Language Model, BERT, Transformer
- BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy.
- XLNet, a generalized autoregressive pretraining method, is proposed:
- It enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order.
- It overcomes the limitations of BERT thanks to its autoregressive formulation. Furthermore, it integrates ideas from Transformer-XL.
Outline
- Motivations
- XLNet: Permutation Language Modeling
- XLNet: Partial Prediction
- XLNet: Incorporating Ideas from Transformer-XL
- Experimental Results
1. Motivations
1.1. Preliminaries
- AR language modeling seeks to estimate the probability distribution of a text corpus with an autoregressive model, such as GPT-1, GPT-2, GPT-3.
- Given a text sequence x=(x1, …, xT), AR language modeling factorizes the likelihood into a forward product:
- or a backward product:
- A parametric model (e.g. a neural network) is trained to model each conditional distribution.
1.2. Autoregressive (AR) Model Problem
- An AR language model is only trained to encode a uni-directional context (either forward or backward). AR language modeling performs pretraining by maximizing the likelihood under the forward autoregressive factorization:
- where hθ(x1:t-1) is a context representation produced by neural models, such as RNNs or Transformers, and e(x) denotes the embedding of x.
- It is not effective at modeling deep bidirectional contexts. On the contrary, downstream language understanding tasks often require bidirectional context information.
- This results in a gap between AR language modeling and effective pretraining.
1.3. Autoencoding (AE) Model Problem
- AE models, such as BERT, it randomly masks x constructs a corrupted version ^x their training objective is to reconstruct bar(x) from ^x:
- where mt=1 indicates xt is masked, and Hθ is a Transformer.
- The artificial symbols like [MASK] used by BERT during pretraining are absent from real data at finetuning time, resulting in a pretrain-finetune discrepancy.
- Moreover, since the predicted tokens are masked in the input, BERT is not able to model the joint probability using the product rule as in AR language modeling.
A natural question to ask is whether there exists a pretraining objective that brings the advantages of both while avoiding their weaknesses.
2. XLNet: Permutation Language Modeling
2.1. Permutations
- For a sequence x of length T, there are T! different orders to perform a valid autoregressive factorization. Intuitively, if model parameters are shared across all factorization orders, in expectation, the model will learn to gather information from all positions on both sides.
- Let ZT be the set of all possible permutations of the length-T index sequence [1, 2, …, T].The Permutation Language Modeling objective is:
- Note that the proposed objective only permutes the factorization order, NOT the sequence order.
- The above figure shows an example of predicting the token x3 given the same input sequence x but under different factorization orders.
- The standard Softmax formulation, as below cannot be used:
- Because hθ does not contain any position information.
2.2. Content and Query Representations
- To avoid this problem, XLNet proposes to re-parameterize the next-token distribution to be target position aware:
- where gθ(xz<t, zt) denotes a new type of representations which additionally take the target position zt as input.
- Thus, two sets of hidden representations are used instead of one:
- The content representation hθ(xz≤t), or abbreviated as hzt, which serves a similar role to the standard hidden states in Transformer.
- The query representation gθ(xz<t, zt), or abbreviated as gzt, which only has access to the contextual information xz<t and the position zt, but not the content xzt.
- The first layer query stream is initialized with a trainable vector, i.e. g(0)i=w, while the content stream is set to the corresponding word embedding, i.e. h(0)i=e(xi).
- For each self-attention layer m=1, …, M, the two streams of representations are schematically updated with a shared set of parameters as follows:
- where Q, K, V denote the query, key, and value in an attention operation.
- During finetuning, we can simply drop the query stream and use the content stream as a normal Transformer(-XL).
3. XLNet: Partial Prediction
- To reduce the optimization difficulty, only the last tokens in a factorization order are predicted.
- Formally, z is split into a non-target subsequence z≤c and a target subsequence z>c, where c is the cutting point. The objective is to maximize the log-likelihood of the target subsequence conditioned on the non-target subsequence:
- A hyperparameter K is introduced to adjust the subsequence length:
4. XLNet: Incorporating Ideas from Transformer-XL
- Two important techniques in Transformer-XL are integrated, namely the relative positional encoding scheme and the segment recurrence mechanism.
- The “mem” part is the segment recurrence mechanism concept of Transformer-XL. It caches the previous states to handle long sequence. (Please feel free to read Transformer-XL if interested.)
- For example, when the order is [3, 2, 4, 1], only those hidden states that have been estimated will contribute the another hidden state estimation.
- Similar for query representation.
- (This section has too many details, so I don’t go deep here. Please feel free to read the paper directly.)
5. Experimental Results
5.1. Fair Comparison with BERT
- All models are trained using the same data and hyperparameters as in BERT. The best of 3 BERT variants are used for comparison; i.e., the original BERT, BERT with whole word masking, and BERT without next sentence prediction.
Trained on the same data with an almost identical training recipe, XLNet outperforms BERT by a sizable margin on all the considered datasets.
5.2. SOTA Comparisons Using Multiple Datasets
- Numerous datasets, such as SQuaD, IMDB, GLUE, are evaluated.
Reference
[2019 NeurIPS] [XLNet]
XLNet: Generalized Autoregressive Pretraining for Language Understanding
Language Model
2007 … 2019 [T64] [Transformer-XL] [BERT] [RoBERTa] [GPT-2] [DistilBERT] [MT-DNN] [Sparse Transformer] [SuperGLUE] [FAIRSEQ] [XLNet] 2020 [ALBERT] [GPT-3] [T5] [Pre-LN Transformer] [MobileBERT] [TinyBERT]