# Review: Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling (GRU)

## Comparable Performance With LSTM, Faster Convergence than Vanilla RNN

--

In this story, **Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling**, (GRU), by Université de Montréal, is briefly reviewed. This is a paper from Prof. Bengio’s group. In this paper:

**GRU is introduced with performance comparable with LSTM.**

This is a paper in **2014 NeurIPS **with over **7800 citations**. (Sik-Ho Tsang @ Medium)

# Outline

**Vanilla Recurrent Neural Network (Vanilla RNN)****Long Short-Term Memory (LSTM)****Gated Recurrent Unit (GRU)****Experimental Results**

**1. Vanilla Recurrent Neural Network (Vanilla RNN)**

- Traditionally, in
**vanilla RNN**, the update of the recurrent hidden state*ht*is:

- where
is a smooth, bounded function such as a logistic sigmoid function or a*g***hyperbolic tangent function**. - (In the above figure,
*g*is**hyperbolic tangent function**tanh.)

However, it is difficult to train RNNs to capture long-term dependencies because the gradients tend to either vanish (most of the time) or explode (rarely, but with severe effects).

- There have been
**two dominant approaches**by which many researchers have tried to reduce the negative impacts of this issue.

One is to devise a

better learning algorithmthan a simple stochastic gradient descent, such as usinggradient clipping.Another one is to design a

more sophisticated activation function, such asLSTMandGRU.

**2. **Long Short-Term Memory (**LSTM)**

- (Please note, there are many LSTM variants, or different presentations for the equations.)
- Normally, each LSTM unit maintains a
**memory**at*ct***time**.*t* - Each LSTM has
**3 gates**:**Forget gate**,**input gate**, and**output gate**.

Whenever there is sigmoid function, it bounds the signal from 0 to 1, which acts as a gate to control the amount of information flow.

## 2.1. Forget Gate

**The extent to which the existing memory is forgotten**is modulated by a**forget gate**:*ft*

- where
is a logistic*σ***sigmoid**function.*Uf*and*Wf*are the weights to be learnt. - As seen, the forget gate is controlled
**based on the input***xt*and the previous hidden state*ht*-1.

## 2.2. Input Gate

**The degree to which the new memory content is added**to the memory cell is modulated by an**input gate**:*it*

- As seen, the input gate is also controlled
**based on the input**.*xt*and the previous hidden state*ht*-1 - But the weights of input gate are independent of those in forget gate.

## 2.3. Cell State

**The memory cell**is updated by partially forgetting the existing memory and adding a*Ct***new memory content**:*~Ct*

- where the new memory content
is:*~Ct*

## 2.4. Output Gate

is an*ot***output gate**that**modulates the amount of memory content exposure**:

- Finally, the
**output**, or the activation, of the LSTM unit is:*ht*

**3. Gated Recurrent Unit (GRU)**

- Similar to the LSTM unit, the GRU has gating units that modulate the flow of information inside the unit, however, without having a separate memory cells.
- As seen, there are only 2 sigmoid functions, that means there are
**only 2 gates**, which are called**reset gate**and**update gate**.

## 3.1. Reset Gate

**The candidate activation~**is computed similarly to that of the traditional recurrent unit:*ht*

- where
*rt*is a set of reset gates and ⊙ is an element-wise multiplication. **The reset gate**is similar to the forget gate in LSTM:

**When off (**, the reset gate effectively makes the unit act as if it is reading the first symbol of an input sequence, allowing it to*rt*close to 0)**forget the previously computed state**.

## 3.2. Update Gate

**The activation**is a*ht*of the GRU at time*t***linear interpolation between the previous activation**:*ht*-1 and the candidate activation ~*ht*

- where an
**update gate**. The update gate is computed by:*zt*decides how much the unit updates its activation, or content

This procedure of taking

a linear sum between the existing state and the newly computed stateis similar to the LSTM unit.

The GRU, however, does not have any mechanism to control the degree to which its state is exposed, but exposes the whole state each time.

- Both LSTM unit and GRU keep the existing content and add the new content on top of it. This provides
**two advantages**:

**Any important feature**, decided by either the forget gate of the LSTM unit or the update gate of the GRU, will not be overwritten but be**maintained**as it is.- It effectively creates
**shortcut paths**that**bypass multiple temporal steps**. These shortcuts allow the error to be**back-propagated easily**without too quickly vanishing.

**4. Experimental Results**

## 4.1. Datasets and Models

- For the
**polyphonic music modeling**,**3 polyphonic music datasets**are used: Nottingham, JSB Chorales, MuseData and Piano-midi. - These datasets contain sequences of which each symbol is respectively a 93-, 96-, 105-, and 108-dimensional binary vector.
- Logistic sigmoid function is used as output units.
- For
**speech signal modeling**,**2 internal datasets**provided by Ubisoft is used. ach sequence is an one-dimensional raw audio signal, and at each time step, we design a recurrent neural network to look at 20 consecutive samples to predict the following 10 consecutive samples. - One with sequences of length 500 (Ubisoft A) and the other with sequences of length 8; 000 (Ubisoft B). Ubisoft A and Ubisoft B have 7,230 and 800 sequences each. A mixture of Gaussians with 20 components is used as output layer.
**The size of each model is chosen so that each model has approximately the same number of parameters.**And the models tend to be small enough in order to avoid overfitting.

## 4.2. Results

In the case of the polyphonic music datasets,

the GRU-RNN outperformed all the others (LSTM-RNN and tanh-RNN) on all the datasets except for the Nottingham.

- However, we can see that on these music datasets, all the three models performed closely to each other.

On the other hand,

the RNNs with the gating units (GRU-RNN and LSTM-RNN) clearly outperformed the more traditional tanh-RNNon both of the Ubisoft datasets.

- The LSTM-RNN was best with the Ubisoft A, and with the Ubisoft B, the GRU-RNN performed best.

- In the case of the music datasets, we see that the GRU-RNN makes faster progress.

- If we consider the Ubisoft datasets, it is clear that although the computational requirement for each update in the
**tanh-RNN**is much smaller than the other models, it**did not make much progress each update**. - These results clearly indicate
**the advantages of the gating units**over the more traditional recurrent units.**Convergence is often faster**, and the final solutions tend to be better. **However, the results are not conclusive in comparing the LSTM and the GRU**, which suggests that the choice of the type of gated recurrent unit may depend heavily on the dataset and corresponding task.

GRU is one of the basic units for NLP applications.

## Reference

[2014 NeurIPS] [GRU]

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

## Natural Language Processing

**Sequence Modeling: 2014** [GRU]**Language Model: 2007 **[Bengio TNN’07]

**2013**[Word2Vec] [NCE] [Negative Sampling]

**[Seq2Seq] [RNN Encoder-Decoder]**

Machine Translation: 2014

Machine Translation: 2014

**2015**[Attention Decoder/RNNSearch]

**Image Captioning:**

**2015**[m-RNN] [R-CNN+BRNN] [Show and Tell/NIC]