Review — Learning Deep Transformer Models for Machine Translation

Pre-Norm Transformer, Layer Normalization First

  • Two novel techniques are proposed:
  1. A proper use of layer normalization is proposed, called pre-norm Transformer, and;
  2. A novel way of passing the combination of previous layers to the next.


  1. Pre-Norm Transformer
  2. Dynamic Linear Combination of Layers (DLCL)
  3. Experimental Results

1. Pre-Norm Transformer

(a) Original Post-Norm Transformer (b) Proposed Pre-Norm Transformer

1.1. Original Post-Norm Transformer

  • On the encoder side, there are a number of identical stacked layers. Each of them is composed of a self-attention sub-layer and a feed-forward sub-layer.
  • The attention model used in Transformer is multi-head attention, and its output is fed into a fully connected feed-forward network.
  • For Transformer, it is not easy to train stacked layers, residual connections and layer normalization are adopted:
  • where xl and xl+1 are the input and output of the l-th sub-layer, and yl is the intermediate output followed by the post-processing function f().
  • Layer normalization is adopted to reduce the variance of sub-layer output, and it is placed after the element-wise residual addition:
  • It can be seen as a post-processing step of the output.

1.2. Proposed Pre-Norm Transformer

  • The above equation regards layer normalization as a part of the sub-layer, and does nothing for post-processing of the residual connection.

1.3. Gradients of Post-Norm

  • A stack of L sub-layers are used as an example. Let E be the loss used to measure how many errors occur in system prediction, and xL be the output of the topmost sub-layer.
  • For post-norm Transformer, given a sub-layer l, the differential of E with respect to xl can be computed by the chain rule, and we have:

1.4. Gradients of Pre-Norm

  • Likewise, we have the gradient for pre-norm:

2. Dynamic Linear Combination of Layers (DLCL)

Connection weights for 3-layer encoder: (a) residual connection (b) dense residual connection (c) multi-layer representation fusion (d) proposed DLCL
  • Let {y0, …, yl} be the output of layers 0~l. The input of layer l+1 is defined to be:
  • where G() is a linear function that merges previously generated values {y0, …, yl} into a new value.
  • For pre-norm Transformer, G() is defined as:
  • For post-norm, G() can be redefined as:

3. Experimental Results

3.1. SOTA Comparison

BLEU scores [%] on English-German translation
  • When increasing the encoder depth, e.g. L=20, the vanilla Transformer failed to train. On the contrary, post-norm DLCL solves this issue and achieves the best result when L=25.
  • Pre-norm is easier to optimize than post-norm in deep networks. Beyond that, a 30-layer encoder is successfully trained, resulting in a further improvement of 0.4 BLEU points. This is 0.6 BLEU points higher than the pre-norm Transformer-Big.
Compare with Bapna et al. (2018) on WMT’16 English-German translation under a 16-layer encoder
  • DLCL in both post-norm and pre-norm cases outperform Transparent Atenttion (TA) by Bapna et al. (2018).

3.2. Zh-En-Small Task

BLEU scores [%] on NIST’12 Chinese-English translation
  • Firstly DLCL is superior to the baseline when the network’s depth is shallow. Interestingly, both Transformer and DLCL achieve the best results when a 25-layer encoder is used.

3.3. Zh-En-Large Task

BLEU scores [%] on WMT’18 Chinese-English translation
  • The 25-layer pre-norm DLCL slightly surpassed Transformer-Big, and the superiority is bigger when using a 30-layer encoder.

3.4. Effect of Encoder Depth

BLEU scores [%] against the encoder depth for pre-norm Transformer and pre-norm DLCL on English-German and Chinese-English tasks.
  • Remarkably, when the encoder depth reaches 20, both of the two deep models can achieve comparable performance to Transformer-Big.
GPU generation speed (target tokens/sec.) against the depth of encoder for pre-norm DLCL on English-German task (batch size = 32, beam size = 4).

3.5. Effect of Decoder Depth

Tokenized BLEU scores [%] and GPU generation speed (target tokens per second) in pre-norm Transformer (Base) on the test set of WMT English-German (batch size = 32, beam size = 4).

3.6. Effect of DLCL

Ablation results by tokenized BLEU [%] on the test set of WMT English-German translation
  • Replacing learnable weights with constant weights: All-One (Wij=1) and Average (Wij=1/(i+1)) consistently hurt performance.

3.7. Weight Visualization

A visualization example of learned weights in 30-layer pre-norm DLCL model
  • The connections in the early layers are dense, but become sparse as the depth increases.
  • Most of the large weight values concentrate on the right of the matrix, which indicates that the impact of the incoming layer is usually related to the distance between the outgoing layer, but the contribution to successive layers changes dynamically (one column).


[2019 ACL] [Pre-Norm Transformer]
Learning Deep Transformer Models for Machine Translation

Machine Translation

2014 [Seq2Seq] [RNN Encoder-Decoder] 2015 [Attention Decoder/RNNSearch] 2016 [GNMT] [ByteNet] [Deep-ED & Deep-Att] 2017 [ConvS2S] [Transformer] [MoE] [GMNMT] [CoVe] 2018 [Shaw NAACL’18] 2019 [AdaNorm] [GPT-2] [Pre-Norm Transformer] 2020 [Batch Augment, BA] [GPT-3] [T5] 2021 [ResMLP]

My Other Previous Paper Readings



PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store