Review — XLNet: Generalized Autoregressive Pretraining for Language Understanding

XLNet, Reduce Pretrain-Finetune Discrepancy

6 min readJul 2, 2022

--

XLNet: Generalized Autoregressive Pretraining for Language Understanding
XLNet, by Carnegie Mellon University, and Google AI Brain Team
2019 NeurIPS, Over 4900 Citations (Sik-Ho Tsang @ Medium)
Natural Language Processing, NLP, Language Model, BERT, Transformer

BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy.
XLNet, a generalized autoregressive pretraining method, is proposed:

It enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order.
It overcomes the limitations of BERT thanks to its autoregressive formulation. Furthermore, it integrates ideas from Transformer-XL.

Outline

Motivations
XLNet: Permutation Language Modeling
XLNet: Partial Prediction
XLNet: Incorporating Ideas from Transformer-XL
Experimental Results

1. Motivations

1.1. Preliminaries

AR language modeling seeks to estimate the probability distribution of a text corpus with an autoregressive model, such as GPT-1, GPT-2, GPT-3.
Given a text sequence x=(x1, …, xT), AR language modeling factorizes the likelihood into a forward product:

or a backward product:

A parametric model (e.g. a neural network) is trained to model each conditional distribution.

1.2. Autoregressive (AR) Model Problem

An AR language model is only trained to encode a uni-directional context (either forward or backward). AR language modeling performs pretraining by maximizing the likelihood under the forward autoregressive factorization:

where hθ(x1:t-1) is a context representation produced by neural models, such as RNNs or Transformers, and e(x) denotes the embedding of x.
It is not effective at modeling deep bidirectional contexts. On the contrary, downstream language understanding tasks often require bidirectional context information.
This results in a gap between AR language modeling and effective pretraining.

1.3. Autoencoding (AE) Model Problem

AE models, such as BERT, it randomly masks x constructs a corrupted version ^x their training objective is to reconstruct bar(x) from ^x:

where mt=1 indicates xt is masked, and Hθ is a Transformer.
The artificial symbols like [MASK] used by BERT during pretraining are absent from real data at finetuning time, resulting in a pretrain-finetune discrepancy.
Moreover, since the predicted tokens are masked in the input, BERT is not able to model the joint probability using the product rule as in AR language modeling.

A natural question to ask is whether there exists a pretraining objective that brings the advantages of both while avoiding their weaknesses.

2. XLNet: Permutation Language Modeling

2.1. Permutations

For a sequence x of length T, there are T! different orders to perform a valid autoregressive factorization. Intuitively, if model parameters are shared across all factorization orders, in expectation, the model will learn to gather information from all positions on both sides.
Let ZT be the set of all possible permutations of the length-T index sequence [1, 2, …, T].The Permutation Language Modeling objective is:

Note that the proposed objective only permutes the factorization order, NOT the sequence order.

Illustration of the permutation language modeling objective for predicting x3 given the same input sequence x but with different factorization orders.

The above figure shows an example of predicting the token x3 given the same input sequence x but under different factorization orders.
The standard Softmax formulation, as below cannot be used:

Because hθ does not contain any position information.

2.2. Content and Query Representations

(a) Content representation, and (b) Query Representation

To avoid this problem, XLNet proposes to re-parameterize the next-token distribution to be target position aware:

where gθ(xz<t, zt) denotes a new type of representations which additionally take the target position zt as input.
Thus, two sets of hidden representations are used instead of one:

The content representation hθ(xz≤t), or abbreviated as hzt, which serves a similar role to the standard hidden states in Transformer.
The query representation gθ(xz<t, zt), or abbreviated as gzt, which only has access to the contextual information xz<t and the position zt, but not the content xzt.

Overview of Two-Stream Attention, i.e. Content Stream & Query Stream

The first layer query stream is initialized with a trainable vector, i.e. g(0)i=w, while the content stream is set to the corresponding word embedding, i.e. h(0)i=e(xi).
For each self-attention layer m=1, …, M, the two streams of representations are schematically updated with a shared set of parameters as follows:

where Q, K, V denote the query, key, and value in an attention operation.
During finetuning, we can simply drop the query stream and use the content stream as a normal Transformer(-XL).

3. XLNet: Partial Prediction

To reduce the optimization difficulty, only the last tokens in a factorization order are predicted.
Formally, z is split into a non-target subsequence z≤c and a target subsequence z>c, where c is the cutting point. The objective is to maximize the log-likelihood of the target subsequence conditioned on the non-target subsequence:

A hyperparameter K is introduced to adjust the subsequence length:

4. XLNet: Incorporating Ideas from Transformer-XL

Two important techniques in Transformer-XL are integrated, namely the relative positional encoding scheme and the segment recurrence mechanism.

A detailed illustration of the content stream of the proposed objective with both the joint view and split views based on a length-4 sequence under the factorization order [3, 2, 4, 1]. Note that if we ignore the query representation, the computation in this figure is simply the standard self-attention, though with a particular attention mask.

The “mem” part is the segment recurrence mechanism concept of Transformer-XL. It caches the previous states to handle long sequence. (Please feel free to read Transformer-XL if interested.)
For example, when the order is [3, 2, 4, 1], only those hidden states that have been estimated will contribute the another hidden state estimation.

A detailed illustration of the query stream of the proposed objective with both the joint view and split views based on a length-4 sequence under the factorization order [3, 2, 4, 1]. The dash arrows indicate that the query stream cannot access the token (content) at the same position, but only the location information.

Similar for query representation.
(This section has too many details, so I don’t go deep here. Please feel free to read the paper directly.)

5. Experimental Results

5.1. Fair Comparison with BERT

All models are trained using the same data and hyperparameters as in BERT. The best of 3 BERT variants are used for comparison; i.e., the original BERT, BERT with whole word masking, and BERT without next sentence prediction.

Trained on the same data with an almost identical training recipe, XLNet outperforms BERT by a sizable margin on all the considered datasets.

5.2. SOTA Comparisons Using Multiple Datasets

Comparison with state-of-the-art results on the test set of RACE, a reading comprehension task, and on ClueWeb09-B, a document ranking task

Results on SQuAD, a reading comprehension dataset

Comparison with state-of-the-art error rates on the test sets of several text classification datasets

Results on GLUE. * indicates using ensembles

Numerous datasets, such as SQuaD, IMDB, GLUE, are evaluated.

XLNet generally outperforms BERT and RoBERTa.

Reference

[2019 NeurIPS] [XLNet]
XLNet: Generalized Autoregressive Pretraining for Language Understanding

Language Model

2007 … 2019 [T64] [Transformer-XL] [BERT] [RoBERTa] [GPT-2] [DistilBERT] [MT-DNN] [Sparse Transformer] [SuperGLUE] [FAIRSEQ] [XLNet] 2020 [ALBERT] [GPT-3] [T5] [Pre-LN Transformer] [MobileBERT] [TinyBERT]

My Other Previous Paper Readings

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Artificial Intelligence

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

No responses yet

Write a response

What are your thoughts?

Also publish to my profile

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams