Brief Review — On the Difficulty of Training Recurrent Neural Networks
SGD+CR, Solving the Gradient Explosion and Vanishing Problems in RNN
On the Difficulty of Training Recurrent Neural Networks
SGD+CR, by Universite de Montreal
2013 ICML, Over 5000 Citations (Sik-Ho Tsang @ Medium)
Recurrent Neural Network, Character Prediction, Language Model
- A gradient norm clipping strategy is proposed to deal with exploding gradients.
- A soft constraint is proposed for the vanishing gradients problem.
- This is a paper from Bengio’s research group.
Outline
- RNN Preliminaries
- Gradient Norm Clipping Strategy
- Soft Constraint for Solving Vanishing Gradients Problem
- Results
1. RNN Preliminaries
- A generic recurrent neural network, with input ut and state xt for time step t, is given by:
- The parameters of the model are given by the recurrent weight matrix Wrec, the biases b and input weight matrix Win.
- A cost E measures the performance of the network on some given task and it can be broken apart into individual costs for each step Et.
- Chain rule is used for gradient calculation. In brief:
- There can be gradient explosion when the gradient is too large. Network will obtain huge loss.
- Or when gradient is too small, gradient vanishing problem occurs. Network cannot be updated efficiently.
2. Gradient Norm Clipping Strategy
- One simple mechanism to deal with a sudden increase in the norm of the gradients is to rescale them whenever they go over a threshold.
3. Soft Constraint for Solving Vanishing Gradients Problem
- Regularizer is used to force δxk+1/δxk not to vanish as it travels back in time:
4. Results
- SGD: Standard SGD
- SGD-C: SGD enhanced with clipping strategy
- SGD-CR: SGD with the clipping strategy and the regularization term
- For sequences longer than 20, the vanishing gradients problem ensures that neither SGD nor SGD-C algorithms can solve the task.
SGD-CR can deal with any sequence of length 50 up to 200.
SGD-CR provides a statistically significant improvement on the state-of-the-art for RNNs on all the polyphonic music prediction tasks.
Though LSTM had been proposed at that moment, this paper tries to use techniques to solve the gradient explosion and vanishing problems without architecture changes.
Gradient clipping is still used by many papers nowadays.
Reference
[2013 ICML] [SGD+CR]
On the Difficulty of Training Recurrent Neural Networks
Language/Sequence Model
2007 … 2013 … [SGD+CR] … 2020 [ALBERT] [GPT-3] [T5] [Pre-LN Transformer] [MobileBERT] [TinyBERT]