Review: Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling (GRU)

Comparable Performance With LSTM, Faster Convergence than Vanilla RNN

Sik-Ho Tsang
7 min readNov 7, 2021
RNN, LSTM, & GRU (Figure from

In this story, Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling, (GRU), by Université de Montréal, is briefly reviewed. This is a paper from Prof. Bengio’s group. In this paper:

  • GRU is introduced with performance comparable with LSTM.

This is a paper in 2014 NeurIPS with over 7800 citations. (

@ Medium)


  1. Vanilla Recurrent Neural Network (Vanilla RNN)
  2. Long Short-Term Memory (LSTM)
  3. Gated Recurrent Unit (GRU)
  4. Experimental Results

1. Vanilla Recurrent Neural Network (Vanilla RNN)

Vanilla RNN (Figure from
  • Traditionally, in vanilla RNN, the update of the recurrent hidden state ht is:
  • where g is a smooth, bounded function such as a logistic sigmoid function or a hyperbolic tangent function.
  • (In the above figure, g is hyperbolic tangent function tanh.)

However, it is difficult to train RNNs to capture long-term dependencies because the gradients tend to either vanish (most of the time) or explode (rarely, but with severe effects).

  • There have been two dominant approaches by which many researchers have tried to reduce the negative impacts of this issue.

One is to devise a better learning algorithm than a simple stochastic gradient descent, such as using gradient clipping.

Another one is to design a more sophisticated activation function, such as LSTM and GRU.

2. Long Short-Term Memory (LSTM)

Long Short-Term Memory (LSTM) (Modify the Figure from
  • (Please note, there are many LSTM variants, or different presentations for the equations.)
  • Normally, each LSTM unit maintains a memory ct at time t.
  • Each LSTM has 3 gates: Forget gate, input gate, and output gate.

Whenever there is sigmoid function, it bounds the signal from 0 to 1, which acts as a gate to control the amount of information flow.

2.1. Forget Gate

  • The extent to which the existing memory is forgotten is modulated by a forget gate ft:
  • where σ is a logistic sigmoid function. Uf and Wf are the weights to be learnt.
  • As seen, the forget gate is controlled based on the input xt and the previous hidden state ht-1.

2.2. Input Gate

  • The degree to which the new memory content is added to the memory cell is modulated by an input gate it:
  • As seen, the input gate is also controlled based on the input xt and the previous hidden state ht-1.
  • But the weights of input gate are independent of those in forget gate.

2.3. Cell State

  • The memory cell Ct is updated by partially forgetting the existing memory and adding a new memory content ~Ct:
  • where the new memory content ~Ct is:

2.4. Output Gate

  • ot is an output gate that modulates the amount of memory content exposure:
  • Finally, the output ht, or the activation, of the LSTM unit is:

3. Gated Recurrent Unit (GRU)

Gated Recurrent Unit (GRU) (Modify figure from
  • Similar to the LSTM unit, the GRU has gating units that modulate the flow of information inside the unit, however, without having a separate memory cells.
  • As seen, there are only 2 sigmoid functions, that means there are only 2 gates, which are called reset gate and update gate.

3.1. Reset Gate

  • The candidate activation~ht is computed similarly to that of the traditional recurrent unit:
  • where rt is a set of reset gates and ⊙ is an element-wise multiplication.
  • The reset gate is similar to the forget gate in LSTM:
  • When off (rt close to 0), the reset gate effectively makes the unit act as if it is reading the first symbol of an input sequence, allowing it to forget the previously computed state.

3.2. Update Gate

  • The activation ht of the GRU at time t is a linear interpolation between the previous activation ht-1 and the candidate activation ~ht:
  • where an update gate zt decides how much the unit updates its activation, or content. The update gate is computed by:

This procedure of taking a linear sum between the existing state and the newly computed state is similar to the LSTM unit.

The GRU, however, does not have any mechanism to control the degree to which its state is exposed, but exposes the whole state each time.

  • Both LSTM unit and GRU keep the existing content and add the new content on top of it. This provides two advantages:
  1. Any important feature, decided by either the forget gate of the LSTM unit or the update gate of the GRU, will not be overwritten but be maintained as it is.
  2. It effectively creates shortcut paths that bypass multiple temporal steps. These shortcuts allow the error to be back-propagated easily without too quickly vanishing.

4. Experimental Results

4.1. Datasets and Models

The sizes of the models for different experiments
  • For the polyphonic music modeling, 3 polyphonic music datasets are used: Nottingham, JSB Chorales, MuseData and Piano-midi.
  • These datasets contain sequences of which each symbol is respectively a 93-, 96-, 105-, and 108-dimensional binary vector.
  • Logistic sigmoid function is used as output units.
  • For speech signal modeling, 2 internal datasets provided by Ubisoft is used. ach sequence is an one-dimensional raw audio signal, and at each time step, we design a recurrent neural network to look at 20 consecutive samples to predict the following 10 consecutive samples.
  • One with sequences of length 500 (Ubisoft A) and the other with sequences of length 8; 000 (Ubisoft B). Ubisoft A and Ubisoft B have 7,230 and 800 sequences each. A mixture of Gaussians with 20 components is used as output layer.
  • The size of each model is chosen so that each model has approximately the same number of parameters. And the models tend to be small enough in order to avoid overfitting.

4.2. Results

The average negative log-probabilities of the training and test sets.

In the case of the polyphonic music datasets, the GRU-RNN outperformed all the others (LSTM-RNN and tanh-RNN) on all the datasets except for the Nottingham.

  • However, we can see that on these music datasets, all the three models performed closely to each other.

On the other hand, the RNNs with the gating units (GRU-RNN and LSTM-RNN) clearly outperformed the more traditional tanh-RNN on both of the Ubisoft datasets.

  • The LSTM-RNN was best with the Ubisoft A, and with the Ubisoft B, the GRU-RNN performed best.
Learning curves for training and validation sets on music datasets
  • In the case of the music datasets, we see that the GRU-RNN makes faster progress.
Learning curves for training and validation sets on Ubisoft datasets
  • If we consider the Ubisoft datasets, it is clear that although the computational requirement for each update in the tanh-RNN is much smaller than the other models, it did not make much progress each update.
  • These results clearly indicate the advantages of the gating units over the more traditional recurrent units. Convergence is often faster, and the final solutions tend to be better.
  • However, the results are not conclusive in comparing the LSTM and the GRU, which suggests that the choice of the type of gated recurrent unit may depend heavily on the dataset and corresponding task.

GRU is one of the basic units for NLP applications.


[2014 NeurIPS] [GRU]
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Natural Language Processing

Sequence Modeling: 2014 [GRU]
Language Model: 2007
[Bengio TNN’07] 2013 [Word2Vec] [NCE] [Negative Sampling]
Machine Translation: 2014
[Seq2Seq] [RNN Encoder-Decoder] 2015 [Attention Decoder/RNNSearch]
Image Captioning: 2015 [m-RNN] [R-CNN+BRNN] [Show and Tell/NIC]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.