Review — Pre-LN Transformer: On Layer Normalization in the Transformer Architecture
Pre-LN Transformer, Warm-Up Stage is Skipped
On Layer Normalization in the Transformer Architecture
Pre-LN Transformer, by Microsoft Research Asia, University of Chinese Academy of Sciences, Peking University, Microsoft Research, and Nankai University
2020 ICML, Over 100 Citations (Sik-Ho Tsang @ Medium)
Natural Language Processing, NLP, Language Model, Machine Translation, Transformer, Layer Normalization, BERT
- The original-designed Post-LN Transformer, which places the layer normalization between the residual blocks, the expected gradients of the parameters near the output layer are large. Using a large learning rate on those gradients makes the training unstable. The warm-up stage can help to avoid this.
- On the other hand, if Pre-LN Transformer is used where the layer normalization is put inside the residual blocks, the gradients are well-behaved at initialization. The warm-up stage can be removed.
1. Original Post-LN Transformer
- (Please skip this part if you know Transformer well, or feel free to read Transformer if interested.)
- Self-attention sub-layer using Multi-Head Attention. The multi-head variant of the self-attention sub-layer is popularly used which allows the model to jointly attend to information from different representation sub-spaces, and is defined as:
- And each Transformer layer contains a fully connected network, which is applied to each position separately and identically.
- Besides the two sub-layers described above, the residual connection and layer normalization are also used.
- where Scale γ and bias vector β are parameters, and:
1.2. The Learning Rate Warm-Up Stage
A learning rate warm-up stage is critical for the Post-LN Transformer.
- The learning rate of the t-th iteration is denoted as lr(t) and the maximum learning rate during training is denoted as lrmax. Given a predefined time frame Twarmup, the learning rate scheduler for the first Twarmup iterations:
- After this warm-up stage, the learning rate will be set by classical learning rate schedulers.
1.3. Experimental Study
- Experiments are conducted to show that this learning rate warm-up stage is essential for training Post-LN Transformer models.
First, we can see that for both optimizers, the learning rate warm-up stage is essential.
- Without the warm-up stage, the BLEU score of the model trained with Adam optimizer can only achieve 8.45. In contrast, the model trained using the warm-up stage can achieve around 34 in terms of BLEU score.
Second, we can see that the optimization process is sensitive to the value of Twarmup.
- For example, when setting Twarmup=500, the learned models with Adam achieve only 31.16 and 2.77 in term of BLEU score for lrmax=5e−4 and 1e−3 respectively.
- Such a warm-up stage has several disadvantages.
First, its configuration significantly affects the final performance. The practitioners need a careful hyper-parameter tuning, which is computationally expensive for large-scale NLP tasks.
Second, the warm-up stage could slow down the optimization.
2. Proposed Pre-LN Transformer
- (Multiple lemma are made and discussed in the paper.)
2.1. Gradient of Weight Parameter
Intuitively, if the random variable Z is (ε, δ)-bounded, then with a high probability its realization will not get too far away from its expectation.
- For the Post-LN Transformer with L layers, the gradient of the parameters of the last layer satisfies:
- For the Pre-LN Transformer with L layers, the gradient is:
- We can see that for the Post-LN Transformer, the scale of the gradients to the last FFN layer is of order O(d√(ln d)) which is independent of L.
- For the Pre-LN Transformer, the scale of the gradients is much smaller, which is O(d√(ln d/L))
2.2. Scale of Hidden State
- The scale of the hidden states in different layers is estimated. Expectations are taken over the input and the randomness of initialization.
If X ∈ R^d is a Gaussian vector, X∼N(0, σ²I_d), then:
- At initialization, for the Post-LN Transformer:
- For the Pre-LN Transformer:
- As shown above, the scale of the expected gradients grows along with the layer index for the Post-LN Transformer. On the contrary, the scale almost keeps the same for different layers in the Pre-LN Transformer.
The main idea is that the layer normalization will normalize the gradients.
- In the Post-LN Transformer, the scale of the inputs to the layer normalization is independent of L, and thus the gradients of parameters in the last layer are independent of L.
3. Experimental Results
3.1. Machine Translation
First, the learning rate warmup stage is not critical anymore for training the Pre-LN.
- On the IWSLT14 De-En task, the 9-th checkpoint of the Pre-LN Transformer achieves nearly the same performance (validation loss/BLEU score) as 15-th checkpoint of the Post-LN Transformer.
Third, compared with RAdam, the change of the position of layer normalization “dominates” the change of the optimizer.
3.2. Unsupervised Pretraining Using BERT
- Similar to the machine translation tasks, the learning rate warm-up stage can be removed for the Pre-LN model.
- (a): The Pre-LN model can be trained faster. The Pre-LN Transformer is easier to optimize using larger learning rates.
- (b) & (c): The Pre-LN model also converges faster on the downstream tasks, MRPC and RTE.
[2020 ICML] [Pre-LN Transformer]
On Layer Normalization in the Transformer Architecture