Review — Pre-LN Transformer: On Layer Normalization in the Transformer Architecture

Pre-LN Transformer, Warm-Up Stage is Skipped

On Layer Normalization in the Transformer Architecture
Pre-LN Transformer, by Microsoft Research Asia, University of Chinese Academy of Sciences, Peking University, Microsoft Research, and Nankai University
2020 ICML, Over 100 Citations (Sik-Ho Tsang @ Medium)
Natural Language Processing, NLP, Language Model, Machine Translation, Transformer, Layer Normalization, BERT

  • The original-designed Post-LN Transformer, which places the layer normalization between the residual blocks, the expected gradients of the parameters near the output layer are large. Using a large learning rate on those gradients makes the training unstable. The warm-up stage can help to avoid this.
  • On the other hand, if Pre-LN Transformer is used where the layer normalization is put inside the residual blocks, the gradients are well-behaved at initialization. The warm-up stage can be removed.

Outline

  1. Original Post-LN Transformer
  2. Proposed Pre-LN Transformer
  3. Experimental Results

1. Original Post-LN Transformer

Post-LN Transformer Layer
  • (Please skip this part if you know Transformer well, or feel free to read Transformer if interested.)
  • Self-attention sub-layer using Multi-Head Attention. The multi-head variant of the self-attention sub-layer is popularly used which allows the model to jointly attend to information from different representation sub-spaces, and is defined as:
  • And each Transformer layer contains a fully connected network, which is applied to each position separately and identically.
  • Besides the two sub-layers described above, the residual connection and layer normalization are also used.
  • where Scale γ and bias vector β are parameters, and:

Different orders of the sub-layers, residual connection and layer normalization in a Transformer layer lead to variants of Transformer architectures.

1.2. The Learning Rate Warm-Up Stage

A learning rate warm-up stage is critical for the Post-LN Transformer.

  • The learning rate of the t-th iteration is denoted as lr(t) and the maximum learning rate during training is denoted as lrmax. Given a predefined time frame Twarmup, the learning rate scheduler for the first Twarmup iterations:
  • After this warm-up stage, the learning rate will be set by classical learning rate schedulers.

1.3. Experimental Study

  • Experiments are conducted to show that this learning rate warm-up stage is essential for training Post-LN Transformer models.
Performances of the models optimized by Adam and SGD on the IWSLT14 De-En task

First, we can see that for both optimizers, the learning rate warm-up stage is essential.

  • Without the warm-up stage, the BLEU score of the model trained with Adam optimizer can only achieve 8.45. In contrast, the model trained using the warm-up stage can achieve around 34 in terms of BLEU score.

Second, we can see that the optimization process is sensitive to the value of Twarmup.

  • For example, when setting Twarmup=500, the learned models with Adam achieve only 31.16 and 2.77 in term of BLEU score for lrmax=5e−4 and 1e−3 respectively.

1.4. Disadvantages

  • Such a warm-up stage has several disadvantages.

First, its configuration significantly affects the final performance. The practitioners need a careful hyper-parameter tuning, which is computationally expensive for large-scale NLP tasks.

Second, the warm-up stage could slow down the optimization.

2. Proposed Pre-LN Transformer

Proposed Pre-LN Transformer Layer
  • (Multiple lemma are made and discussed in the paper.)

2.1. Gradient of Weight Parameter

Intuitively, if the random variable Z is (ε, δ)-bounded, then with a high probability its realization will not get too far away from its expectation.

  • For the Post-LN Transformer with L layers, the gradient of the parameters of the last layer satisfies:
  • For the Pre-LN Transformer with L layers, the gradient is:
  • We can see that for the Post-LN Transformer, the scale of the gradients to the last FFN layer is of order O(d√(ln d)) which is independent of L.
  • For the Pre-LN Transformer, the scale of the gradients is much smaller, which is O(d√(ln d/L))

2.2. Scale of Hidden State

  • The scale of the hidden states in different layers is estimated. Expectations are taken over the input and the randomness of initialization.

If XR^d is a Gaussian vector, XN(0, σ²I_d), then:

2.3. Advantage

Gradient Expectation (The norm of gradients of 1)
  • As shown above, the scale of the expected gradients grows along with the layer index for the Post-LN Transformer. On the contrary, the scale almost keeps the same for different layers in the Pre-LN Transformer.

The main idea is that the layer normalization will normalize the gradients.

  • In the Post-LN Transformer, the scale of the inputs to the layer normalization is independent of L, and thus the gradients of parameters in the last layer are independent of L.

While in the Pre-LN Transformer, the scale of the input to the final layer normalization is linear in L, and thus the gradients of all parameters will be normalized by √L.

3. Experimental Results

3.1. Machine Translation

Performances of the models on the IWSLT14 De-En task and WMT14 En-De task

First, the learning rate warmup stage is not critical anymore for training the Pre-LN.

Second, the Pre-LN Transformer converges faster than the Post-LN Transformer using the same lrmax.

  • On the IWSLT14 De-En task, the 9-th checkpoint of the Pre-LN Transformer achieves nearly the same performance (validation loss/BLEU score) as 15-th checkpoint of the Post-LN Transformer.

Third, compared with RAdam, the change of the position of layer normalization “dominates” the change of the optimizer.

3.2. Unsupervised Pretraining Using BERT

Performances of the models on unsupervised pre-training (BERT) and downstream tasks
  • Similar to the machine translation tasks, the learning rate warm-up stage can be removed for the Pre-LN model.
  • (a): The Pre-LN model can be trained faster. The Pre-LN Transformer is easier to optimize using larger learning rates.
  • (b) & (c): The Pre-LN model also converges faster on the downstream tasks, MRPC and RTE.

Pre-LN Transformer does not rely on the learning rate warm-up stage and can be trained much faster than the Post-LN Transformer.

Reference

[2020 ICML] [Pre-LN Transformer]
On Layer Normalization in the Transformer Architecture

Language/Sequence Model

2007 … 2019 [T64] [Transformer-XL] [BERT] [RoBERTa] [GPT-2] [DistilBERT] [MT-DNN] [Sparse Transformer] [SuperGLUE] 2020 [ALBERT] [GPT-3] [T5] [Pre-LN Transformer]

Machine Translation

2014 … 2018 [Shaw NAACL’18] 2019 [AdaNorm] [GPT-2] [Pre-Norm Transformer] 2020 [Batch Augment, BA] [GPT-3] [T5] [Pre-LN Transformer] 2021 [ResMLP]

My Other Previous Paper Readings

PhD, Researcher. I share what I've learnt and done. :) My LinkedIn: https://www.linkedin.com/in/sh-tsang/, My Paper Reading List: https://bit.ly/33TDhxG

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Weekly-mendations #021

[Paper] ProxylessNAS: Direct Neural Architecture Search on Target Task (Image Classification)

CLASSIEfier: Using machine learning to paint a picture of social sector trends

Mechanism of Action (MoA)- The Kaggle Competition

What Are Artificial Neural Networks, Really?

Machine Learning for Product Managers

An Analysis of Mask R-CNNs

Machine Learning for Incident Management Process

Get the Medium app

Sik-Ho Tsang

Sik-Ho Tsang

PhD, Researcher. I share what I've learnt and done. :) My LinkedIn: https://www.linkedin.com/in/sh-tsang/, My Paper Reading List: https://bit.ly/33TDhxG

More from Medium

Review — DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Positional encoding in Transformer architectures. Trax translation inference.

Overview — LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading

Deep learning pipelines with PADL: build them like you mean it