# Review — XLNet: Generalized Autoregressive Pretraining for Language Understanding

## XLNet, Reduce Pretrain-Finetune Discrepancy

XLNet: Generalized Autoregressive Pretraining for Language Understanding, by Carnegie Mellon University, and Google AI Brain Team

XLNet2019 NeurIPS, Over 4900 Citations(Sik-Ho Tsang @ Medium)

Natural Language Processing, NLP, Language Model, BERT, Transformer

- BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy.
**XLNet**, a**generalized autoregressive pretraining**method, is proposed:

- It enables
**learning bidirectional contexts**by maximizing the expected likelihood over all permutations of the factorization order. - It
**overcomes the limitations of****BERT****integrates ideas from****Transformer-XL**.

# Outline

**Motivations****XLNet: Permutation Language Modeling****XLNet: Partial Prediction****XLNet: Incorporating Ideas from****Transformer-XL****Experimental Results**

# 1. Motivations

## 1.1. Preliminaries

- AR language modeling seeks to estimate the probability distribution of a text corpus with an autoregressive model, such as GPT-1, GPT-2, GPT-3.
**Given a text sequence**, AR language modeling*x*=(*x*1, …,*xT*)**factorizes the likelihood into a forward product**:

- or a
**backward product**:

- A parametric model (e.g. a neural network) is trained to model each conditional distribution.

## 1.2. Autoregressive (AR) Model Problem

- An
**AR language model**is only trained to**encode a uni-directional context**(either forward or backward). AR language modeling performs pretraining by maximizing the likelihood under the forward autoregressive factorization:

- where
is a*hθ*(*x*1:*t*-1)**context representation**produced by neural models, such as RNNs or Transformers, and*e*(*x*)**embedding**of*x*. - It is
**not effective at modeling deep bidirectional contexts**. On the contrary,**downstream language understanding tasks**often**require bidirectional context information**. - This results in a
**gap between AR language modeling and effective pretraining**.

## 1.3. Autoencoding (AE) Model Problem

- AE models, such as BERT, it randomly masks
*x*constructs a corrupted version ^*x*their training objective is to**reconstruct bar(**:*x*) from ^*x*

- where
*mt*=1 indicates*xt*is masked, andis a Transformer.*Hθ* - The artificial symbols like
**[MASK] used by****BERT****during pretraining**are**absent from real data at finetuning time**, resulting in a**pretrain-finetune discrepancy**. - Moreover, since the predicted tokens are masked in the input,
**BERT****is not able to model the joint probability using the product rule as in AR**language modeling.

A natural question to ask is whether there exists a pretraining objective that brings the advantages of both while avoiding their weaknesses.

# 2. XLNet: Permutation Language Modeling

## 2.1. Permutations

- For
**a sequence**, there are*x*of length*T*to perform a valid autoregressive factorization. Intuitively, if model parameters are shared across all factorization orders, in expectation, the model will learn to gather information from all positions on both sides.*T*! different orders - Let
be*ZT***the set of all possible permutations**of the length-*T*index sequence [1, 2, …,*T*].The Permutation Language Modeling**objective**is:

- Note that the proposed objective
**only permutes the factorization order, NOT the sequence order.**

- The above figure shows an example of
**predicting the token**but under*x*3 given the same input sequence*x***different factorization orders**. - The
**standard Softmax**formulation, as below cannot be used:

- Because
*hθ does not contain any position information.*

## 2.2. Content and Query Representations

- To avoid this problem, XLNet proposes to re-parameterize the next-token distribution to be
**target position aware**:

- where
denotes*gθ*(*xz*<*t*,*zt*)**a new type of representations**which additionally**take the target position**.*zt*as input - Thus, two sets of hidden representations are used instead of one:

**The content representation**, or abbreviated as*hθ*(*xz*≤*t*), which serves a similar role to the standard hidden states in Transformer.*hzt***The query representation**, or abbreviated as*gθ*(*xz*<*t*,*zt*), which*gzt***only has access to the contextual information**, but not the content*xz*<*t*and the position*zt**xzt*.

- The
**first layer query stream**is initialized with a**trainable vector**, i.e., while the*g*(0)*i*=*w***content stream**is set to the corresponding**word embedding**, i.e..*h*(0)*i*=*e*(*xi*) - For
**each self-attention layer**, the two streams of representations are schematically updated with*m*=1, …,*M***a shared set of parameters**as follows:

- where
*Q*,*K*,*V*denote the query, key, and value in an attention operation. **During finetuning**, we can**simply drop the query stream**and use the content stream as a normal Transformer(-XL).

# 3. XLNet: Partial Prediction

- To reduce the optimization difficulty,
**only the last tokens in a factorization order are predicted**. - Formally,
*z*is split**non-target subsequence**and a*z*≤*c***target subsequence**, where*z*>*c*is the*c***cutting point**. The objective is to maximize the log-likelihood of the target subsequence conditioned on the non-target subsequence:

- A hyperparameter
*K*is introduced to adjust the subsequence length:

# 4. XLNet: Incorporating Ideas from **Transformer-XL**

- Two important techniques in
**Transformer-XL**are integrated, namely the**relative positional encoding scheme**and the**segment recurrence mechanism.**

- The
**“mem”**part is**the segment recurrence mechanism concept of****Transformer-XL**. It**caches the previous states**to handle long sequence. (Please feel free to read Transformer-XL if interested.) - For example,
**when the order is [3, 2, 4, 1]**,**only those hidden states that have been estimated will contribute**the another hidden state estimation.

- Similar for query representation.
- (This section has too many details, so I don’t go deep here. Please feel free to read the paper directly.)

# 5. Experimental Results

## 5.1. Fair Comparison with BERT

- All models are trained using the same data and hyperparameters as in BERT. The best of 3 BERT variants are used for comparison; i.e., the original BERT, BERT with whole word masking, and BERT without next sentence prediction.

Trained on the same data with an almost identical training recipe,

XLNet outperformsBERTby a sizable marginon all the considered datasets.

## 5.2. SOTA Comparisons Using Multiple Datasets

- Numerous datasets, such as SQuaD, IMDB, GLUE, are evaluated.

## Reference

[2019 NeurIPS] [XLNet]

XLNet: Generalized Autoregressive Pretraining for Language Understanding

## Language Model

**2007 …** **2019** [T64] [Transformer-XL] [BERT] [RoBERTa] [GPT-2] [DistilBERT] [MT-DNN] [Sparse Transformer] [SuperGLUE] [FAIRSEQ] [XLNet] **2020 **[ALBERT] [GPT-3] [T5] [Pre-LN Transformer] [MobileBERT] [TinyBERT]