Review: Layer Normalization (LN)

Stabilizing Training, Reduce Training Time

Layer Normalization (Image from Group Normalization)
  • Batch Normalization (BN) is dependent on the mini-batch size.
  • Layer Normalization (LN) is proposed by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case.
  • This is a tech report from Prof. Geoffrey E. Hinton Group.

Outline

  1. Batch Normalization (BN)
  2. Layer Normalization (LN)
  3. Experimental Results

1. Batch Normalization (BN)

1.1. Conventional Neural Network Without BN

  • A feed-forward neural network is a non-linear mapping from a input pattern x to an output vector y.
  • The summed inputs are computed through a linear projection with the weight matrix Wl and the bottom-up inputs hl given as follows:
  • where bli is bias.

1.2. Conventional Neural Network With BN

  • BN was proposed in BN-Inception / Inception-v2 to reduce undesirable “covariate shift”. The method normalizes the summed inputs to each hidden unit over the training cases.
  • Specifically, for the i-th summed input in the l-th layer, the batch normalization method rescales the summed inputs according to their variances under the distribution of the data:
  • where bar(ali) is normalized summed inputs to the i-th hidden unit in the l-th layer and gi is a gain parameter scaling the normalized activation before the non-linear activation function.
  • Note the expectation is under the whole training data distribution.
  • μ and σ are estimated using the empirical samples from the current mini-batch.

2. Layer Normalization (LN)

2.1. LN

  • In LN, the “covariate shift” problem can be reduced by fixing the mean and the variance of the summed inputs within each layer.
  • The LN statistics is computed over all the hidden units in the same layer as follows:
  • where H denotes the number of hidden units in a layer.

2.2. Layer Normalized RNN

  • In a standard RNN, the summed inputs in the recurrent layer are computed from the current input xt and previous vector of hidden states ht-1:
  • The layer normalized recurrent layer re-centers and re-scales its activations:
  • where Whh is the recurrent hidden to hidden weights and Wxh are the bottom up input to hidden weights. ⊙ is the element-wise multiplication between two vectors. b and g are defined as the bias and gain parameters of the same dimension as ht.

3. Experimental Results

3.1. Skip-Thought Experiments

Performance of Skip-Thought with and without LN on downstream tasks as a function of training iterations
Skip-Thought Results

3.2. Permutation Invariant MNIST

Permutation invariant MNIST 784–1000–1000–10 model negative log likelihood and test error with layer normalization and batch normalization.
  • LN is only applied to the fully-connected hidden layers excluding the last softmax layer.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store