# Review — Learning Deep Transformer Models for Machine Translation

## Pre-Norm Transformer, Layer Normalization First

• Two novel techniques are proposed:
1. A proper use of layer normalization is proposed, called pre-norm Transformer, and;
2. A novel way of passing the combination of previous layers to the next.

# Outline

1. Pre-Norm Transformer
2. Dynamic Linear Combination of Layers (DLCL)
3. Experimental Results

# 1. Pre-Norm Transformer

## 1.1. Original Post-Norm Transformer

• On the encoder side, there are a number of identical stacked layers. Each of them is composed of a self-attention sub-layer and a feed-forward sub-layer.
• The attention model used in Transformer is multi-head attention, and its output is fed into a fully connected feed-forward network.
• For Transformer, it is not easy to train stacked layers, residual connections and layer normalization are adopted:
• where xl and xl+1 are the input and output of the l-th sub-layer, and yl is the intermediate output followed by the post-processing function f().
• Layer normalization is adopted to reduce the variance of sub-layer output, and it is placed after the element-wise residual addition:
• It can be seen as a post-processing step of the output.

## 1.2. Proposed Pre-Norm Transformer

• The above equation regards layer normalization as a part of the sub-layer, and does nothing for post-processing of the residual connection.

• A stack of L sub-layers are used as an example. Let E be the loss used to measure how many errors occur in system prediction, and xL be the output of the topmost sub-layer.
• For post-norm Transformer, given a sub-layer l, the differential of E with respect to xl can be computed by the chain rule, and we have:

• Likewise, we have the gradient for pre-norm:

# 2. Dynamic Linear Combination of Layers (DLCL)

• Let {y0, …, yl} be the output of layers 0~l. The input of layer l+1 is defined to be:
• where G() is a linear function that merges previously generated values {y0, …, yl} into a new value.
• For pre-norm Transformer, G() is defined as:
• For post-norm, G() can be redefined as:

# 3. Experimental Results

## 3.1. SOTA Comparison

• When increasing the encoder depth, e.g. L=20, the vanilla Transformer failed to train. On the contrary, post-norm DLCL solves this issue and achieves the best result when L=25.
• Pre-norm is easier to optimize than post-norm in deep networks. Beyond that, a 30-layer encoder is successfully trained, resulting in a further improvement of 0.4 BLEU points. This is 0.6 BLEU points higher than the pre-norm Transformer-Big.
• DLCL in both post-norm and pre-norm cases outperform Transparent Atenttion (TA) by Bapna et al. (2018).

• Firstly DLCL is superior to the baseline when the network’s depth is shallow. Interestingly, both Transformer and DLCL achieve the best results when a 25-layer encoder is used.

• The 25-layer pre-norm DLCL slightly surpassed Transformer-Big, and the superiority is bigger when using a 30-layer encoder.

## 3.4. Effect of Encoder Depth

• Remarkably, when the encoder depth reaches 20, both of the two deep models can achieve comparable performance to Transformer-Big.

## 3.6. Effect of DLCL

• Replacing learnable weights with constant weights: All-One (Wij=1) and Average (Wij=1/(i+1)) consistently hurt performance.

## 3.7. Weight Visualization

• The connections in the early layers are dense, but become sparse as the depth increases.
• Most of the large weight values concentrate on the right of the matrix, which indicates that the impact of the incoming layer is usually related to the distance between the outgoing layer, but the contribution to successive layers changes dynamically (one column).

## Reference

[2019 ACL] [Pre-Norm Transformer]
Learning Deep Transformer Models for Machine Translation

## Machine Translation

2014 [Seq2Seq] [RNN Encoder-Decoder] 2015 [Attention Decoder/RNNSearch] 2016 [GNMT] [ByteNet] [Deep-ED & Deep-Att] 2017 [ConvS2S] [Transformer] [MoE] [GMNMT] [CoVe] 2018 [Shaw NAACL’18] 2019 [AdaNorm] [GPT-2] [Pre-Norm Transformer] 2020 [Batch Augment, BA] [GPT-3] [T5] 2021 [ResMLP]

--

--