# Review — GLM: General Language Model Pretraining with Autoregressive Blank Infilling

## GLM, Fill in the Blanks Autoregressively. Later, GLM-130B is Built Based on GLM and Accepted in 2023 ICLR

--

GLM: General Language Model Pretraining with Autoregressive Blank Infilling,GLM, by Tsinghua University, Beijing Academy of Artificial Intelligence (BAAI), MIT CSAIL, and Shanghai Qi Zhi Institute,2022 ACL, Over 15 Citations(Sik-Ho Tsang @ Medium)

2.1. Language Model / Sequence Model

1991 …2022[GPT-NeoX-20B] [InstructGPT] [GLM]

==== My Other Paper Readings Are Also Over Here ====

**General Language Model (GLM)**is proposed based on**autoregressive blank infilling**.- GLM improves blank filling pretraining by
**adding 2D positional encodings**and**allowing an arbitrary order to predict spans.** - Meanwhile, GLM can be
**pretrained for different types of tasks**by**varying the number and lengths of blanks**. - Later on,
**GLM-130B**is Built Based on GLM and Accepted in**2023 ICLR**. (Hope I can review it later.)

# Outline

**GLM Pretraining****GLM Model Architecture****Results**

**1. GLM Pretraining**

**GLM formulates NLU tasks as cloze questions**that contain task descriptions, which can be**answered by autoregressive generation**.

## 1.1. Autoregressive Blank Infilling

- Given an
**input text**, where*x*=[*x*1, …,*xn*], multiple text spans {*s*1, …,*sm*} are sampled**each span**corresponds to*si***a series of consecutive tokens [***s_i*,1;… ;*s_i*,*li*] in*x*. **Each span is replaced with a single [MASK] token**, forming a**corrupted text**.*xcorrupt*

When

predicting the missing tokensin a span, the model hasaccess to the corrupted text and the previously predicted spans.

- The
**order**of the spans is**randomly permuted**, similar to the**permutation language model (****XLNet****)**. Formally, letbe the set of*Zm***all possible permutations**of the length-*m*index sequence [1, 2, , …,*m*], and*s_z*<*i*be [s_z1, …,*s*_*zi*-1], the**pretraining objective**is defined as:

- The
**tokens in each blank**are always**generated**following a**left-to-right****order**, i.e.**the probability of generating the span**:*si*is factorized as

- The
**input**is divided into two parts:*x***Part A**is the**corrupted text**, and*xcorrupt***Part B**consists of the**masked spans**.

Part A tokenscanattend to each other, butcannot attend to any tokens in B.Part B tokenscanattend to Part A and antecedents in B, butcannot attend to any subsequent tokens in B.

- To enable autoregressive generation,
**each span is padded with special tokens [START] and [END]**, for input and output respectively.

In this way, GLM

automatically learns a bidirectional encoder (for Part A)anda unidirectional decoder (for Part B)in a unified model.

**Spans of length**are randomly sampled, drawn from a**Poisson distribution with**. New spans are*λ*=3**repeatedly sampled until at least 15% of the original tokens are masked**. Empirically, 15% ratio is critical for good performance on downstream NLU tasks.

## 1.2. Multi-Task Pretraining

- Authors are interested in
**pretraining a single model**that can**handle both NLU and text generation**. - A multi-task pretraining setup is studied, in which a
**second objective**of**generating longer text is jointly optimized with the blank infilling objective**. The following**two objectives**are considered:

**Document-level**: A**single span is sampled**whose**length is sampled from a uniform distribution over 50%–100% of the original length**. The objective aims for**long text generation**.**Sentence-level**: It is restricted that**the masked spans must be full sentences**.**Multiple spans (sentences) are sampled to cover 15%**of the original tokens. This objective aims for**seq2seq tasks**whose**predictions are often complete sentences or paragraph**s.

- Their only difference is the number of spans and the span lengths.

# 2. GLM Model Architecture

## 2.1. Model Architecture

- GLM uses a single
**Transformer****several modifications**to the architecture: (1)**The order of****layer normalization****and the****residual connection****is reaaranged**, which has been shown critical for large-scale language models to avoid numerical errors (as in**Megatron-LM**); (2) A**single linear layer**for the output token prediction; (3)**ReLU****s are****replaced with****GELU****s**.

## 2.2. 2D Positional Encoding

- Each token is encoded with
**two positional ids**. - The
**first**positional id represents**the position in the corrupted text**. For the masked spans, it is the position of the corresponding [MASK] token.*xcorrupt* - The
**second**positional id represents the**intra-span position**. - For tokens in Part A, their second positional ids are 0. For tokens in Part B, they range from 1 to the length of the span.
- The two positional ids are
**projected into two vectors**via**learnable embedding tables**, which are**both added to the input token embeddings.**

## 2.3. Finetuning GLM

- Typically, for downstream NLU tasks, a
**linear classifier**is added on top of the model, leading to**inconsistency between pretraining and finetuning.** - Here, NLU classification tasks is reformulated as
**generation tasks of blank infilling**, as above. - Specifically, given a
**labeled example (**, the*x*,*y*)**input text**via a pattern containing a single mask token.*x*is converted to a cloze question*c*(*x*) - The conditional probability of
**predicting**is:*y*given*x*

- As an example in the figure, the labels “positive” and “negative” are mapped to the words “good” and “bad”. In this case,
**GLM is fine-tuned with a cross-entropy loss**.

# 3. Results

## 3.1. SuperGLUE

- The pretrained GLM models are
**fine-tuned on each task**.

GLM consistently outperformsBERTon most tasks with either base or large architecture. On average, GLMBase scores 4.6% higher than BERTBase, and GLMLarge scores 5.0% higher than BERTLarge.

- In the setting of RoBERTaLarge,
**GLM****RoBERTa****can still achieve improvements**over the baselines, but with a smaller margin. - Specifically,
**GLM****RoBERTa****outperforms****T5****Large but is only half its size**. - With
**multi-task pretraining**, within one training batch,**short spans and longer spans**(document-level or sentence-level) are sampled with equal chances.

GLMDoc and GLMSent perform slightly worse than GLMLarge, butstill outperformBERTLarge andUniLMLarge. Among multitask models,GLMSent outperforms GLMDocby 1.1% on average.

## 3.2. Sequence-to-Sequence

GLMRoBERTa can achieve performance

matching theSeq2SeqBARTmodel, and outperformT5andUniLMv2.

Tables 3 & 4:GLMLarge can achieve performance matching the other pretraining models on the two generation tasks. GLMSent can perform better than GLMLarge, while GLMDoc performs slightly worse than GLMLarge.

## 3.3. Text Filling

Table 5:GLM outperforms previous methods by large margins (1.3 to 3.9 BLEU)and achieves the state-of-the-art result on this dataset.

## 3.4. Language Modeling

**Figure 4**: All the models are evaluated in the**zero-shot setting**.

Increasing the model’s parameters to 410M(1.25 of GPTLarge) leads toa performance close toGPTLarge.

- (There are also ablation experiments, please feel free to read the paper directly if you’re interested.)