# Review — Pre-LN Transformer: On Layer Normalization in the Transformer Architecture

## Pre-LN Transformer, Warm-Up Stage is Skipped

--

On Layer Normalization in the Transformer ArchitecturePre-LN Transformer, by Microsoft Research Asia, University of Chinese Academy of Sciences, Peking University, Microsoft Research, and Nankai University2020 ICML, Over 100 Citations(Sik-Ho Tsang @ Medium)

Natural Language Processing, NLP, Language Model, Machine Translation, Transformer, Layer Normalization, BERT

**The original-designed Post-LN****Transformer**, which places the layer normalization between the residual blocks,**the expected gradients**of the parameters near the output layer are**large**. Using a large learning rate on those gradients makes the**training unstable**. The warm-up stage can help to avoid this.- On the other hand, if
**Pre-LN****Transformer****layer normalization****is put inside the residual blocks**, the**gradients**are**well-behaved at initialization**. The**warm-up stage can be removed**.

# Outline

**Original Post-LN****Transformer****Proposed Pre-LN****Transformer****Experimental Results**

# 1. Original Post-LN Transformer

- (Please skip this part if you know Transformer well, or feel free to read Transformer if interested.)
**Self-attention sub-layer****using Multi-Head Attention.**The multi-head variant of the self-attention sub-layer is popularly used which**allows the model to jointly attend to information from different representation sub-spaces**, and is defined as:

- And each Transformer layer contains a
**fully connected network**, which is applied to each position separately and identically.

- Besides the two sub-layers described above, the
**residual connection**and**layer normalization**

- where Scale γ and bias vector β are parameters, and:

Different orders of the sub-layers,

residual connection andlayer normalizationin aTransformerlayer lead to variants ofTransformerarchitectures.

## 1.2. The Learning Rate Warm-Up Stage

A learning rate warm-up stage is critical for the Post-LNTransformer.

- The learning rate of the
*t*-th iteration is denoted as lr(*t*) and the maximum learning rate during training is denoted as*lrmax*. Given a predefined time frame*Twarmup*,**the learning rate scheduler for the first**:*Twarmup*iterations

- After this warm-up stage, the learning rate will be set by classical learning rate schedulers.

## 1.3. Experimental Study

- Experiments are conducted to show that this learning rate warm-up stage is essential for training Post-LN Transformer models.

First, we can see that for both optimizers,the learning rate warm-up stage is essential.

**Without the warm-up stage**, the**BLEU**score of the model trained with Adam optimizer can only achieve**8.45**. In contrast, the model trained**using the warm-up stage**can achieve around**34**in terms of**BLEU**score.

Second, we can see thatthe optimization process is sensitive to the value ofTwarmup.

- For example, when setting
, the learned models with Adam achieve*Twarmup*=500**only 31.16 and 2.77**in term of BLEU score for**lrmax=5e−4 and 1e−3**respectively.

## 1.4. Disadvantages

- Such a warm-up stage has several disadvantages.

First, its configuration significantly affects the final performance.

The practitioners need a careful hyper-parameter tuning, which iscomputationally expensivefor large-scale NLP tasks.Second, the warm-up stage could

slow down the optimization.

# 2. Proposed Pre-LN Transformer

- (Multiple lemma are made and discussed in the paper.)

## 2.1. Gradient of Weight Parameter

Intuitively, if the random variable

Zis (ε,δ)-bounded, then with a high probability its realization will not get too far away from its expectation.

- For the
**Post-LN****Transformer****with**, the*L*layers**gradient**of the parameters of the last layer satisfies:

**For the Pre-LN****Transformer****with**:*L*layers, the gradient is

- We can see that for the
**Post-LN****Transformer****, the scale of the gradients**to the last FFN layer is of order**O(**which is independent of*d*√(ln*d*))*L*. - For the
**Pre-LN****Transformer****, the scale of the gradients is much smaller, which is O(***d*√(ln*d*/*L*))

## 2.2. Scale of Hidden State

- The scale of the hidden states in different layers is estimated. Expectations are taken over the input and the randomness of initialization.

If

X∈R^dis a Gaussian vector,, then:X∼N(0,σ²I_d)

- At initialization, for the
**Post-LN****Transformer**:

- For the
**Pre-LN****Transformer**:

## 2.3. Advantage

- As shown above, the scale of the expected gradients grows along with the layer index for the Post-LN Transformer. On the contrary, the scale almost keeps the same for different layers in the Pre-LN Transformer.

The main idea is that

thelayer normalizationwill normalize the gradients.

- In the Post-LN Transformer, the scale of the inputs to the layer normalization is independent of
*L*, and thus the gradients of parameters in the last layer are independent of*L*.

While in the

Pre-LNTransformer,the scale of the input to the finallayer normalizationis linear in, and thusLthe gradients of all parameters will be normalized by √L.

# 3. Experimental Results

## 3.1. Machine Translation

First,

the learning rate warmup stage is not critical anymore for training the Pre-LN.Second, the Pre-LN Transformer

converges fasterthan the Post-LN Transformer using the samelrmax.

- On the IWSLT14 De-En task, the 9-th checkpoint of the Pre-LN Transformer achieves nearly the same performance (validation loss/BLEU score) as 15-th checkpoint of the Post-LN Transformer.

Third, compared with RAdam,

the change of the position oflayer normalization“dominates” the change of the optimizer.

## 3.2. Unsupervised Pretraining Using BERT

- Similar to the machine translation tasks,
**the learning rate warm-up stage can be removed**for the Pre-LN model. **(a):**The Pre-LN model can be trained**faster**. The Pre-LN Transformer is**easier to optimize using larger learning rates**.**(b) & (c)**: The Pre-LN model also**converges faster on the downstream tasks**, MRPC and RTE.

Pre-LN Transformer does not rely on the learning rate warm-up stage and can be trained much faster than the Post-LN Transformer.

## Reference

[2020 ICML] [Pre-LN Transformer]

On Layer Normalization in the Transformer Architecture

## Language/Sequence Model

**2007 … 2019** [T64] [Transformer-XL] [BERT] [RoBERTa] [GPT-2] [DistilBERT] [MT-DNN] [Sparse Transformer] [SuperGLUE] **2020 **[ALBERT] [GPT-3] [T5] [Pre-LN Transformer]

## Machine Translation

**2014 … 2018 **[Shaw NAACL’18] **2019 **[AdaNorm] [GPT-2] [Pre-Norm Transformer] **2020 **[Batch Augment, BA] [GPT-3] [T5] [Pre-LN Transformer] **2021 **[ResMLP]