Brief Review — On the Difficulty of Training Recurrent Neural Networks

SGD+CR, Solving the Gradient Explosion and Vanishing Problems in RNN

Sik-Ho Tsang
3 min readAug 12, 2022

On the Difficulty of Training Recurrent Neural Networks
, by Universite de Montreal
2013 ICML, Over 5000 Citations
(Sik-Ho Tsang @ Medium)
Recurrent Neural Network, Character Prediction, Language Model

  • A gradient norm clipping strategy is proposed to deal with exploding gradients.
  • A soft constraint is proposed for the vanishing gradients problem.
  • This is a paper from Bengio’s research group.


  1. RNN Preliminaries
  2. Gradient Norm Clipping Strategy
  3. Soft Constraint for Solving Vanishing Gradients Problem
  4. Results

1. RNN Preliminaries

Schematic of a recurrent neural network
  • A generic recurrent neural network, with input ut and state xt for time step t, is given by:
  • The parameters of the model are given by the recurrent weight matrix Wrec, the biases b and input weight matrix Win.
Unrolling recurrent neural networks in time by creating a copy of the model for each time step
  • A cost E measures the performance of the network on some given task and it can be broken apart into individual costs for each step Et.
  • Chain rule is used for gradient calculation. In brief:
  • There can be gradient explosion when the gradient is too large. Network will obtain huge loss.
  • Or when gradient is too small, gradient vanishing problem occurs. Network cannot be updated efficiently.

2. Gradient Norm Clipping Strategy

  • One simple mechanism to deal with a sudden increase in the norm of the gradients is to rescale them whenever they go over a threshold.

3. Soft Constraint for Solving Vanishing Gradients Problem

  • Regularizer is used to force δxk+1/δxk not to vanish as it travels back in time:

4. Results

Rate of success for solving the temporal order problem versus log of sequence length
  • SGD: Standard SGD
  • SGD-C: SGD enhanced with clipping strategy
  • SGD-CR: SGD with the clipping strategy and the regularization term
  • For sequences longer than 20, the vanishing gradients problem ensures that neither SGD nor SGD-C algorithms can solve the task.

SGD-CR can deal with any sequence of length 50 up to 200.

Results on polyphonic music prediction in negative log likelihood per time step
Results on the next character prediction task in entropy (bits/character)

SGD-CR provides a statistically significant improvement on the state-of-the-art for RNNs on all the polyphonic music prediction tasks.

Though LSTM had been proposed at that moment, this paper tries to use techniques to solve the gradient explosion and vanishing problems without architecture changes.

Gradient clipping is still used by many papers nowadays.


[2013 ICML] [SGD+CR]
On the Difficulty of Training Recurrent Neural Networks

Language/Sequence Model

20072013 … [SGD+CR] … 2020 [ALBERT] [GPT-3] [T5] [Pre-LN Transformer] [MobileBERT] [TinyBERT]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.