Review — Pre-LN Transformer: On Layer Normalization in the Transformer Architecture

Pre-LN Transformer, Warm-Up Stage is Skipped

  • The original-designed Post-LN Transformer, which places the layer normalization between the residual blocks, the expected gradients of the parameters near the output layer are large. Using a large learning rate on those gradients makes the training unstable. The warm-up stage can help to avoid this.
  • On the other hand, if Pre-LN Transformer is used where the layer normalization is put inside the residual blocks, the gradients are well-behaved at initialization. The warm-up stage can be removed.

Outline

  1. Original Post-LN Transformer
  2. Proposed Pre-LN Transformer
  3. Experimental Results

1. Original Post-LN Transformer

Post-LN Transformer Layer
  • (Please skip this part if you know Transformer well, or feel free to read Transformer if interested.)
  • Self-attention sub-layer using Multi-Head Attention. The multi-head variant of the self-attention sub-layer is popularly used which allows the model to jointly attend to information from different representation sub-spaces, and is defined as:
  • And each Transformer layer contains a fully connected network, which is applied to each position separately and identically.
  • Besides the two sub-layers described above, the residual connection and layer normalization are also used.
  • where Scale γ and bias vector β are parameters, and:

1.2. The Learning Rate Warm-Up Stage

  • The learning rate of the t-th iteration is denoted as lr(t) and the maximum learning rate during training is denoted as lrmax. Given a predefined time frame Twarmup, the learning rate scheduler for the first Twarmup iterations:
  • After this warm-up stage, the learning rate will be set by classical learning rate schedulers.

1.3. Experimental Study

  • Experiments are conducted to show that this learning rate warm-up stage is essential for training Post-LN Transformer models.
Performances of the models optimized by Adam and SGD on the IWSLT14 De-En task
  • Without the warm-up stage, the BLEU score of the model trained with Adam optimizer can only achieve 8.45. In contrast, the model trained using the warm-up stage can achieve around 34 in terms of BLEU score.
  • For example, when setting Twarmup=500, the learned models with Adam achieve only 31.16 and 2.77 in term of BLEU score for lrmax=5e−4 and 1e−3 respectively.

1.4. Disadvantages

  • Such a warm-up stage has several disadvantages.

2. Proposed Pre-LN Transformer

Proposed Pre-LN Transformer Layer
  • (Multiple lemma are made and discussed in the paper.)

2.1. Gradient of Weight Parameter

  • For the Post-LN Transformer with L layers, the gradient of the parameters of the last layer satisfies:
  • For the Pre-LN Transformer with L layers, the gradient is:
  • We can see that for the Post-LN Transformer, the scale of the gradients to the last FFN layer is of order O(d√(ln d)) which is independent of L.
  • For the Pre-LN Transformer, the scale of the gradients is much smaller, which is O(d√(ln d/L))

2.2. Scale of Hidden State

  • The scale of the hidden states in different layers is estimated. Expectations are taken over the input and the randomness of initialization.

2.3. Advantage

Gradient Expectation (The norm of gradients of 1)
  • As shown above, the scale of the expected gradients grows along with the layer index for the Post-LN Transformer. On the contrary, the scale almost keeps the same for different layers in the Pre-LN Transformer.
  • In the Post-LN Transformer, the scale of the inputs to the layer normalization is independent of L, and thus the gradients of parameters in the last layer are independent of L.

3. Experimental Results

3.1. Machine Translation

Performances of the models on the IWSLT14 De-En task and WMT14 En-De task
  • On the IWSLT14 De-En task, the 9-th checkpoint of the Pre-LN Transformer achieves nearly the same performance (validation loss/BLEU score) as 15-th checkpoint of the Post-LN Transformer.

3.2. Unsupervised Pretraining Using BERT

Performances of the models on unsupervised pre-training (BERT) and downstream tasks
  • Similar to the machine translation tasks, the learning rate warm-up stage can be removed for the Pre-LN model.
  • (a): The Pre-LN model can be trained faster. The Pre-LN Transformer is easier to optimize using larger learning rates.
  • (b) & (c): The Pre-LN model also converges faster on the downstream tasks, MRPC and RTE.

Reference

[2020 ICML] [Pre-LN Transformer]
On Layer Normalization in the Transformer Architecture

Language/Sequence Model

2007 … 2019 [T64] [Transformer-XL] [BERT] [RoBERTa] [GPT-2] [DistilBERT] [MT-DNN] [Sparse Transformer] [SuperGLUE] 2020 [ALBERT] [GPT-3] [T5] [Pre-LN Transformer]

Machine Translation

2014 … 2018 [Shaw NAACL’18] 2019 [AdaNorm] [GPT-2] [Pre-Norm Transformer] 2020 [Batch Augment, BA] [GPT-3] [T5] [Pre-LN Transformer] 2021 [ResMLP]

My Other Previous Paper Readings

--

--

PhD, Researcher. I share what I learn. :) Reads: https://bit.ly/33TDhxG, LinkedIn: https://www.linkedin.com/in/sh-tsang/, Twitter: https://twitter.com/SHTsang3

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store