# Review — AdaNorm: Adaptive Normalization

## Improve LayerNorm (**Layer Normalization)**

Understanding and Improving Layer NormalizationAdaNorm, by Peking University2019 NeurIPS, Over 50 Citations(Sik-Ho Tsang @ Medium)

Machine Translation, Language Model, Image Classification, Layer Normalization

# 1. LayerNorm

- Let
*x*=(*x*1,*x*2, …,*xH*) be the vector representation of an input of size*H*to normalization layers. LayerNorm re-centers and re-scales input*x*as:

- where
is the*h***output**of a LayerNorm layer. ⊙ is a dot production operation.and*μ*are the*σ***mean**and**standard deviation**of input.**Bias**and*b***gain**are parameters with the same dimension*g**H*. - LayerNorm is a default setting in Transformer and Transformer-XL.

# 2. LayerNorm-simple

- For
**machine translation**,**Transformer** - For
**language model**, 12-layer**Transformer-XL**is used. - For
**text classification**,**Transformer** - For
**image classification**,**3-layer CNN**is used. - For
**parsing**,**MLP**-based parser is used.

The bias and gain do NOT workon six out of eight datasets.

# 3. **DetachNorm**

**Detaching derivatives**means treating the**mean and variance as changeable constants**, rather than variables, which do not require gradients in backward propagation.

- The function
*θ*(.) can be seen as a special copy function, which copies the values of*μ*and*σ*into constants ^*μ*and ^*σ*.

In all,

DetachNormkeeps the same forward normalization fact asLayerNormdoes, butcuts offs the derivatives of the mean and variance.

- DetachNorm performs worse than “w/o Norm”, showing that forward normalization has little to do with the success of LayerNorm.

DetachNorm performs worse thanLayerNorm-simple on six datasets.The derivatives of the mean and variance bring higher improvements than forward normalization does.

# 4. AdaNorm

- In AdaNorm,
*Φ*(*y*)*x*, is used to**replace the bias and gain**with the following equation:

Unlike the bias and gain being fixed in LayerNorm,

Φ(y) can adaptively adjust scaling weights based on inputs.

**To keep the training stability, some constraints are made.****(1)**First,*Φ*(*y*) must be**differentiable**.**(2)**Second,**the average scaling weight is expected to be fixed**, namely the average of*Φ*(*y*) is a constant*C*where*C*> 0.**(3)**Third, it is expected that**the average of**, which can avoid the problem of exploding loss.*z*is bounded- By considering above constraints and based on Chebyshev’s Inequality,
**finally,***Φ*(*y*) is:

- (Please feel free to read the paper if interested for this proof.)
- Given an input vector
*x*, the complete calculation process of**AdaNorm**is:

- where
is a*C***hyper-parameter**,.*k*=1/10 - In implementation,
**the gradient of**and it is only treated as a*C*(1-*ky*) is detached**changeable constant**.

**AdaNorm outperforms****LayerNorm****on seven datasets**, with 0.2 BLEU on En-De, 0.1 BLEU on De-En, 0.2 BLEU on En-Vi, 0.29 ACC on RT, 1.31 ACC on SST, 0.22 ACC on MNIST, and 0.11 UAC on PTB.

Unlike LayerNorm-simple only performing well on bigger models,

AdaNorm achieves more balanced results.

## Reference

[2019 NeurIPS] [AdaNorm]

Understanding and Improving Layer Normalization

## Natural Language Processing (NLP)

**Language/Sequence Model: 2007 **[Bengio TNN’07] **2013 **[Word2Vec] [NCE] [Negative Sampling] **2014** [GloVe] [GRU] [Doc2Vec] **2015 **[Skip-Thought] **2016 **[GCNN/GLU] [context2vec] [Jozefowicz arXiv’16] [LSTM-Char-CNN] **2017 **[TagLM] [CoVe] [MoE] **2018 **[GLUE] [T-DMCA] [GPT] [ELMo] **2019** [T64] [Transformer-XL] [BERT] [RoBERTa] [GPT-2]**Machine Translation: 2014** [Seq2Seq] [RNN Encoder-Decoder] **2015** [Attention Decoder/RNNSearch] **2016** [GNMT] [ByteNet] [Deep-ED & Deep-Att] **2017 **[ConvS2S] [Transformer] [MoE] [GMNMT] **2019 **[AdaNorm]