# Brief Review — A Simple Way to Initialize Recurrent Networks of Rectified Linear Units

## IRNN, RNN Initialized Using Identity Matrix

--

A Simple Way to Initialize Recurrent Networks of Rectified Linear UnitsIRNN, by Google2015 arXiv v2, Over 700 Citations(Sik-Ho Tsang @ Medium)

Recurrent Neural Network, RNN

- To overcome vanishing and exploding gradients,
**instead of using LSTM**,**identity matrix**or**its scaled version**is proposed to**initialize the recurrent weight matrix**. This is a tech report from Prof. Hinton’s research group.

# Outline

**RNN Initialized Using Identity Matrix (IRNN)****Results**

# 1. **RNN Initialized Using Identity Matrix (IRNN)**

## 1.1. Identity Matrix

**The standard RNN uses****ReLU****Gradient clipping**, used in SGD+CR, is used.- The recurrent
**weight matrix**is initialized to be the**identity matrix**and**biases**to be**zero**, such that no extra error-derivatives are added at the beginning.

## 1.2. Scaled Version

- For tasks that exhibit less long range dependencies,
**scaling the identity matrix by a small scalar**is an effective mechanism to forget long range effects.

# 2. Results

## 2.1. Adding Problem

- At every time step, the input consists of a random signal and a mask signal.
- This is a sequence regression problem where the target is
**a sum of two numbers selected in a sequence of random signals.**

Theconvergence of IRNNs is as good as LSTMs. Yet,each LSTM step is more expensivethan an IRNN step.

## 2.2. MNIST Classification from a Sequence of Pixels

- The networks
**read one pixel at a time in scanline order**(i.e. starting at the top left corner of the image, and ending at the bottom right corner). This is therefore a**huge long range dependency problem**because each recurrent network has**784 time steps**. - All networks have 100 recurrent hidden units.

Standard RNNs fail to work, even with ReLUs, whereas theIRNN achieves 3% test error rate.LSTM did not work as well as IRNN.

## 2.3. Language Model on **1 Billion Word Benchmark**

The performance of IRNNs is

closer to the performance of LSTMsfor this large-scale task than it is to the performance of RNNs.

## 2.4. Frame Error Rate on TIMIT Phone Recognition Task

**Scaled value of 0.01**is used in IRNN.*I*

The IRNN easily

outperforms the RNNthat uses tanh units and iscomparable to LSTM.

Identity matrix or its scaled version is used for weight initialization to stabilize the training and achieve better performance.

## Reference

[2015 arXiv v2] [IRNN]

A Simple Way to Initialize Recurrent Networks of Rectified Linear Units

## Language Model / Sequence Model

**2007 **… **2015** … [IRNN] **2016 …** **2020 **[ALBERT] [GPT-3] [T5] [Pre-LN Transformer] [MobileBERT] [TinyBERT]