Review — Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

Designed a New Hidden Unit for Statistical Machine Translation (SMT) Using RNN Encoder-Decoder

9 min readOct 9, 2021

In this paper, Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, (RNN Encoder-Decoder), by Universit´e de Montr´eal, Jacobs University, and Universit´e du Maine, is reviewed. This is also a paper by Prof. Bengio. In this paper:

RNN Encoder-Decoder is proposed that consists of two recurrent neural networks (RNN), which are jointly trained.
One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols.
A new hidden unit that adaptively remembers and forgets are also proposed instead of using LSTM unit.

This is a paper in 2014 EMNLP with over 15000 citations. (Sik-Ho Tsang @ Medium) (I write this story to reinforce my NLP knowledge, if you are familiar with it, please skip to save time, lol.)

Outline

Recurrent Neural Networks (RNN)
RNN Encoder-Decoder
Hidden Unit that Adaptively Remembers and Forgets
Statistical Machine Translation (SMT)
English/French Translation Task Results

1. Recurrent Neural Networks (RNN)

A recurrent neural network (RNN) is a neural network that consists of a hidden state h and an optional output y which operates on a variable-length sequence x = (x1, …, xT). At each time step t, the hidden state h<t> of the RNN is updated by Eq. (1):

where f is a non-linear activation function. f may be as simple as an element-wise logistic sigmoid function and as complex as a long short-term memory (LSTM) unit.
An RNN can learn a probability distribution over a sequence by being trained to predict the next symbol in a sequence. In that case, the output at each timestep t is the conditional distribution p(xt | xt-1, …, x1).
For example, a multinomial distribution (1-of-K coding) can be output using a softmax activation function Eq. (2):

where wj are the rows of a weight matrix W.
By combining these probabilities, we can compute the probability of the sequence x using Eq. (3):

2. RNN Encoder-Decoder

2.1. Probabilistic View

As mentioned, a novel neural network architecture is proposed that learns to encode a variable-length sequence into a fixed-length vector representation and to decode a given fixed-length vector representation back into a variable-length sequence.
From a probabilistic perspective, it learns the conditional distribution over a variable-length sequence conditioned on yet another variable-length sequence, p(y1, …, yT’ | x1, …, xT).
The input and output sequence lengths T and T’ may differ.

**An illustration of the proposed RNN Encoder–Decoder**

2.2. Encoder

The encoder is an RNN that reads each symbol of an input sequence x sequentially. As it reads each symbol, the hidden state of the RNN changes according to Eq. (1).
After reading the end of the sequence (marked by an end-of-sequence symbol), the hidden state of the RNN is a summary c of the whole input sequence.

2.3. Decoder

The decoder of the proposed model is another RNN which is trained to generate the output sequence by predicting the next symbol yt given the hidden state h<t>.
Both yt and h<t> are also conditioned on yt-1 and on the summary c of the input sequence. Hence, the hidden state of the decoder at time t is computed by Eq. (4):

and similarly, the conditional distribution of the next symbol is Eq. (5):

where f and g are the activation functions.

2.4. Training

The two components of the proposed RNN Encoder-Decoder are jointly trained to maximize the conditional log-likelihood Eq. (6):

where θ is the set of the model parameters and each (xn, yn) is an (input sequence, output sequence) pair from the training set.

2.5. Testing

Once trained, the model can be used in two ways.

One way is to use the model to generate a target sequence given an input sequence.
On the other hand, the model can be used to score a given pair of input and output sequences, where the score is simply a probability pθ(y|x).

3. Hidden Unit that Adaptively Remembers and Forgets

3.1. New Hidden Unit

**An illustration of the proposed hidden activation function**

A new hidden unit, which is much simpler than LSTM, is proposed.
First, the reset gate rj is computed by Eq. (7):

where σ is the logistic sigmoid function, and [.]j denotes the j-th element of a vector. x and ht-1 are the input and the previous hidden state, respectively. Wr and Ur are weight matrices.
Similarly, the update gate zj is computed by Eq. (8):

The actual activation of the proposed unit hj is then computed by Eq. (9):

where (Eq. (10)):

When the reset gate is close to 0, the hidden state is forced to ignore the previous hidden state and reset with the current input only.
This effectively allows the hidden state to drop any information that is found to be irrelevant later in the future, thus, allowing a more compact representation.

As each hidden unit has separate reset and update gates, each hidden unit will learn to capture dependencies over different time scales.

4. Statistical Machine Translation (SMT)

The goal of the system (decoder, specifically) is to find a translation f given a source sentence e, which maximizes (Eq. (11)):

where the first term at the right hand side is called translation model and the latter language model.
In practice, however, most SMT systems model log p(f|e) as a loglinear model with additional features and corresponding weights (Eq. (12)):

where fn and wn are the n-th feature and weight, respectively.
Z(e) is a normalization constant that does not depend on the weights. The weights are often optimized to maximize the BLEU score on a development set.

5. English/French Translation Task Results

5.1. Data & Datasets

WMT’14 English/French Translation is used for evaluation.
Large amounts of resources are available to build an English/French SMT system in the framework of the WMT’14 translation task.
The bilingual corpora include Europarl (61M words), news commentary (5.5M), UN (421M), and two crawled corpora of 90M and 780M words respectively. The last two corpora are quite noisy.
To train the French language model, about 712M words of crawled newspaper material is available in addition to the target side of the bitexts.
A subset of 418M words is selected out of more than 2G words for language modeling and a subset of 348M out of 850M words is selected for training the RNN Encoder–Decoder.
The test set newstest2012 and 2013 is used for data selection and weight tuning with MERT, and newstest2014 as our test set.
Each set has more than 70 thousand words and a single reference translation.
The source and target vocabulary are limited to the most frequent 15,000 words for both English and French. This covers approximately 93% of the dataset. All the out-of-vocabulary words were mapped to a special token ([UNK]).

5.2. BLEU Scores

**BLEU scores computed on the development and test sets**

(For more details about BLEU: Please feel free to watch: C5W3L06 Bleu Score (Optional))

5.2.1. Baseline

The baseline phrase-based SMT system was built using Moses with default settings. This system achieves a BLEU score of 30.64 and 33.3 on the development and test sets, respectively.

5.2.2. RNN

The proposed RNN Encoder-Decoder uses 1000 hidden units with the proposed gates at the encoder and at the decoder.
The activation function for ~h in Eq. (10) is a hyperbolic tangent.
The computation from the hidden state in the decoder to the output is with a single intermediate layer having 500 Maxout units each pooling 2 inputs.
At each update, 64 randomly selected phrase pairs are used. The model was trained for approximately three days.

Finally, it outperforms baseline.

5.2.3. CSLM

The CSLM model is trained on 7-grams from the target corpus. Each input word was projected into the embedding space R512, and they were concatenated to form a 3072-dimensional vector.
The concatenated vector was fed through two rectified layers (of size 1536 and 1024) (Glorot et al., 2011). The output layer was a simple softmax layer.

5.2.4. CSLM+RNN

As expected, adding features computed by neural networks consistently improves the performance over the baseline performance.

The best performance was achieved when using both CSLM and the phrase scores from the RNN Encoder-Decoder. This suggests that the contributions of the CSLM and the RNN Encoder-Decoder are not too correlated.

5.2.5. CSLM+RNN+WP

Furthermore, Word Penalization (WP) is used which is to penalize the number of words that are unknown to the neural networks (i.e. words which are not in the shortlist), by simply adding the number of unknown words as an additional feature the loglinear model in (Eq. (12)).
But WP cannot achieve better performance on the test set, but only on the development set.

5.3. Qualitative Results

The source phrases that are long (more than 3 words per source phrase) and frequent are selected.

The choices of the target phrases by the RNN Encoder-Decoder are closer to actual or literal translations.
And the RNN Encoder-Decoder prefers shorter phrases in general.

**Samples generated from the RNN Encoder-Decoder for each source phrase**

50 samples are generated and the top-five phrases are shown accordingly to their scores.

RNN Encoder–Decoder is able to propose well-formed target phrases without looking at the actual phrase table.
Importantly, the generated phrases do not overlap completely with the target phrases from the phrase table.

5.4. Word and Phrase Representations

**2D Embedding of the Learned Word Representation**

Since the proposed RNN Encoder-Decoder also projects to and maps back from a sequence of words into a continuous space vector, semantically meaningful embeddings is observed with the proposed model as well.
The representation c in this case is a 1000-dimensional vector. The projection was done by the recently proposed Barnes-Hut-SNE and is shown above.
It is clearly seen that semantically similar words are clustered with each other.

**2D Embedding of the Learned Phrase Representation**

RNN Encoder-Decoder captures both semantic and syntactic structures of the phrases.
For instance, in the bottom-left plot, most of the phrases are about the duration of time, while those phrases that are syntactically similar are clustered together.
The bottom-right plot shows the cluster of phrases that are semantically similar (countries or regions).
On the other hand, the top-right plot shows the phrases that are syntactically similar.

I believe this is one of the beginner papers for NLP.

Reference

[2014 EMNLP] [RNN Encoder-Decoder]
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

Natural Language Processing (NLP)

2014 [Seq2Seq] [RNN Encoder-Decoder]