# Brief Review — On the Difficulty of Training Recurrent Neural Networks

## SGD+CR, Solving the Gradient Explosion and Vanishing Problems in RNN

--

On the Difficulty of Training Recurrent Neural Networks, by Universite de Montreal

SGD+CR(Sik-Ho Tsang @ Medium)

2013 ICML, Over 5000 Citations

Recurrent Neural Network, Character Prediction, Language Model

**A gradient norm clipping strategy**is proposed to deal with**exploding gradients.****A soft constraint**is proposed for the**vanishing gradients problem**.- This is a paper from Bengio’s research group.

# Outline

**RNN Preliminaries****Gradient Norm Clipping Strategy****Soft Constraint for Solving Vanishing Gradients Problem****Results**

**1. RNN Preliminaries**

- A generic recurrent neural network, with
**input**and*ut***state**for*xt***time step**, is given by:*t*

- The parameters of the model are given by the
**recurrent weight matrix**, the*Wrec***biases**and*b***input weight matrix**.*Win*

- A
**cost**measures the performance of the network on some given task and it can be*E***broken apart into individual costs for each step**.*Et* **Chain rule**is used for**gradient calculation**. In brief:- There can be
**gradient explosion**when**the gradient is too large**. Network will obtain**huge loss**. - Or when
**gradient is too small**,**gradient vanishing**problem occurs.**Network cannot be updated efficiently**.

# 2. Gradient Norm Clipping Strategy

- One simple mechanism to deal with a sudden increase in the norm of the gradients is to
**rescale them whenever they go over a threshold**.

**3. Soft Constraint for Solving Vanishing Gradients Problem**

**Regularizer**is used to**force**as it travels back in time:*δxk*+1/*δxk*not to vanish

# 4. Results

**SGD**: Standard SGD**SGD-C**: SGD enhanced**with clipping strategy****SGD-CR:**SGD with**the clipping strategy**and**the regularization term**- For sequences longer than 20, the vanishing gradients problem ensures that neither SGD nor SGD-C algorithms can solve the task.

SGD-CR can

deal with any sequence of length 50 up to 200.

SGD-CRprovides astatistically significant improvementon the state-of-the-art for RNNs on all the polyphonic music prediction tasks.

Though LSTM had been proposed at that moment, this paper tries to use techniques to solve the gradient explosion and vanishing problems without architecture changes.

Gradient clipping is still used by many papers nowadays.

## Reference

[2013 ICML] [SGD+CR]

On the Difficulty of Training Recurrent Neural Networks

## Language/Sequence Model

**2007** … **2013 **… [SGD+CR] … **2020 **[ALBERT] [GPT-3] [T5] [Pre-LN Transformer] [MobileBERT] [TinyBERT]