Review — UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training

UniLMv2, Jointly Train Autoencoding (AE), and Partially Autoregressive (PAR) Objectives

Given input x1, …, x6, the tokens x2; x4; x5 are masked by the special tokens [M] and [P]. For each example, UniLMv2 jointly train two types of LMs, namely, autoencoding (AE), and partially autoregressive (PAR) masked LMs.
  • A unified language model is pretrained for both autoencoding and partially autoregressive language modeling tasks using a novel training procedure, referred to as a pseudo-masked language model (PMLM).
  • Conventional masks learn inter-relations between corrupted tokens and context via autoencoding, and pseudo masks to learn intra-relations between masked spans via partially autoregressive modeling.


  1. Unified Language Model Pre-Training v2 (UniLMv2)
  2. UniLMv2 Illustrative Example
  3. Results

1. Unified Language Model Pre-Training v2 (UniLMv2)

Overview of Pseudo-Masked Language Model (PMLM) pre-training.
Comparisons with other pretraining objectives. (Given input x = x1, …, x6, the tokens x2; x4; x5 are masked.)

1.1. Autoencoding Modeling

  • This is the same as in the BERT one.
  • Given original input x=x1, …, x|x| and the positions of masks M={m1, …, m|M|}, the probability of masked tokens is computed by the product of probabilities below, where \ is the set minus to exclude the masked tokens.
  • The autoencoding pre-training loss is defined as:

1.2. Partially Autoregressive Modeling

  • In each factorization step, the model can predict one or multiple tokens.
  • Let M=<M1, …, M|M|> denote factorization order, where Mi={mi1, …, mi|Mi|} is the set of mask positions in the i-th factorization step.
  • If all factorization steps only contain one masked token (i.e., |Mi| = 1), the modeling becomes autoregressive.
  • The probability of masked tokens is decomposed as:
  • The partially autoregressive pre-training loss is defined as:
  • where EM is the expectation over the factorization distribution.
  • During pre-training, UniLMv2 randomly samples one factorization order M for each input text.
  • As described in the above Algorithm, UniLMv2 randomly samples 15% of the original tokens as masked tokens. Among them, 40% of the time a n-gram block (n=6) is masked, and 60% of the time a token is masked.
Comparisons between autoencoding (AE), autoregressive (AR), and partially autoregressive (PAR) masked language models.
  • The above figure shows the differences of AE, AR and proposed PAR.
  • Both the special tokens [M] and [P] emit predicted tokens. The training objective is to maximize the likelihood of correct tokens, which considers two types of LMs:
  • A model with the same model size as BERTBASE, i.e. a 12-layer Transformer with 12 attention heads, for ease of comparison.
  • Relative position bias, as in Shaw NAACL’18, is added to attention scores.
  • 160GB text corpora is used from English Wikipedia1, BookCorpus, OpenWebText2, CC-News, and Stories. Wordpiece tokenization is used. The vocabulary size was 30,522.
  • The batch size was set to 7680. Pre-training procedure for 0.5 million steps is used, which took about 20 days using 64 Nvidia V100–32GB GPU cards.

2. UniLMv2 Illustrative Example

Example of the factorization steps 4, 5 → 2. The masks [P] and [M] are assigned with the same position embeddings as the corresponding tokens. Different context is used to compute the hidden states for the pseudo masks of x4; x5 and x2.
  • As shown above, the example’s factorization order is 4, 5 → 2.
  • When we compute p(x4, x5|x\{2,4,5}), only x1, x3, x6 and the pseudo masks of x4, x5 are conditioned on. The original tokens of x4, x5 are masked to avoid information leakage, while their pseudo tokens [P] are used as placeholders for MLM predictions.
  • In the second step, the tokens x1, x3, x4, x5, x6 and the pseudo mask of x2 are conditioned on to compute p(x2|x\{2}). Unlike in the first step, the original tokens of x4, x5 are used for the prediction.
Self-attention mask of the factorization order is 4, 5 → 2.
  • The above figure shows the selfattention mask matrix used for the example. Both conventional masks [M] and given context (x1, x3, x6) can be attended by all the tokens.

3. Results

3.1. SQuaD & GLUE

Left: Results of BASE-size pre-trained models on the SQuAD v1.1/v2.0 development sets. Right Results of BASE-size models on the development set of the GLUE benchmark.

3.2. Abstractive Summarization

Abstractive summarization results on CNN/DailyMail and XSum.

3.3. Question Generation

Results on question generation.

3.4. Ablation Study

Comparisons between the pre-training objectives.


2.1. Language Model / Sequence Model

My Other Previous Paper Readings



PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store