Review — Neural Machine Translation by Jointly Learning to Align and Translate

Using Attention Decoder, Automatically Search for Part of Source Sentence at Encoder for Machine Translation

Sik-Ho Tsang
7 min readOct 16, 2021
Attention Decoder/RNNSearch (Figure from

In this story, Neural Machine Translation by Jointly Learning to Align and Translate, (Attention Decoder/RNNSearch), by Jacobs University Bremen, and Universit´e de Montr´eal, is reviewed. This is a paper by the group of Prof. Bengio. In previous RNN Encoder-Decoder and Seq2Seq, a fixed-length vector is used in between encoder and decoder. In this paper:

  • Attention Decoder is designed to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word.

This is a paper in 2015 ICLR with over 20000 citations. (Sik-Ho Tsang @ Medium)


  1. Proposed Architecture Using Attention Decoder
  2. Encoder: Bidirectional RNN (BiRNN)
  3. Decoder: Attention Decoder
  4. Experimental Results

1. Proposed Architecture Using Attention Decoder

Proposed Architecture Using Attention Decoder (Top: Decoder, Bottom: Encoder)
  • The bottom part is the encoder to receive the source sentence as input.
  • The top part is the decoder to output the translated sentence.

2. Encoder: Bidirectional RNN (BiRNN)

Encoder: Bidirectional RNN (BiRNN)
  • Bidirectional RNN (BiRNN) is used as the encoder.
  • For the activation function f of an RNN, the gated hidden unit proposed by RNN Encoder-Decoder is used.
  • This gated unit is similar to LSTM.
  • The model takes a source sentence of 1-of-K coded word vectors as input,
  • And outputs a translated sentence of 1-of-K coded word vectors.
  • where Kx and Ky are the vocabulary sizes of source and target languages, respectively. Tx and Ty respectively denote the lengths of source and target sentences.
  • First, the forward states of the bidirectional recurrent neural network (BiRNN) are computed:
  • where:
  • E is the word embedding matrix. W and U are weight matrices. m and n are the word embedding dimensionality and the number of hidden units, respectively. σ(.) is as usual a logistic sigmoid function.
  • The forward and backward states are concatenated to obtain the annotations:
  • where i is ranged from 1 to Tx.

3. Decoder: Attention Decoder

Decoder: Attention Decoder

3.1. Hidden State si

  • The hidden state si of the decoder given the annotations from the encoder is computed by:
  • where:

3.2. Context Vector ci

  • The context vector ci are recomputed at each step by the alignment model:
  • where:
  • eij is obtained through the neural network with si-1 and hj as input, and va, Wa, and Ua are weight matrices.
  • αij is the softmax output based on eij and all eik.
  • (Hence, the figure in the paper, to me, it is not clear enough, and somehow misleading me. Because αij is also calculated based on si-1 besides hj. There should be an arrow pointing from si-1 to αij as well.)
  • If ci is fixed to →hTx, it is RNN Encoder-Decoder (RNNencdec).

3.3. Target Word yi

  • With the decoder state si-1, the context ci and the last generated word yi-1, the probability of a target word yi is defined as:
  • where:
  • And ~ti is based on the decoder state si-1, the context ci and the last generated word yi-1:
  • The above max operation is the one Maxout hidden layer.

4. Experimental Results

4.1. Dataset

  • WMT ’14 contains the following English-French parallel corpora: Europarl (61M words), news commentary (5.5M), UN (421M) and two crawled corpora of 90M and 272.5M words respectively, totaling 850M words.
  • The size of the combined corpus is reduced to 348M words.
  • Monolingual data is not used, only parallel corpora is used.
  • 2012 and news-test-2013 to make a development (validation) set, and evaluate the models on the test
  • news-test-2012 and news-test-2013 are concatenated as a development (validation) set.
  • The models are evaluated on the test set (news-test-2014) from WMT ’14, which consists of 3003 sentences not present in the training data.
  • After a usual tokenization, a shortlist of 30,000 most frequent words is used in each language to train the models.
  • Any word not included in the shortlist is mapped to a special token ([UNK]).
  • No other special preprocessing, such as lowercasing or stemming, to the data.

4.2. Models

  • Two models: The first one is an RNN Encoder–Decoder (RNNencdec), the second one is the proposed attention decoder, i.e. RNNSearch.
  • Each model is trained twice: First with the sentences of length up to 30 words (RNNencdec-30, RNNsearch-30) and then with the sentences of length up to 50 word (RNNencdec-50, RNNsearch-50).
  • The encoder and decoder of the RNNencdec have 1000 hidden units each.
  • A multilayer network with a single Maxout hidden layer to compute the conditional probability of each target word.
  • A minibatch of 80 sentences is used, and each model is trained for approximately 5 days.
  • A beam search to find a translation that approximately maximizes the conditional probability.

4.3. BLEU Results

BLEU scores of the trained models computed on the test set (RNNsearch-50* was trained much longer)
  • It is clear from the table that in all the cases, the proposed RNNsearch outperforms the conventional RNNencdec.

More importantly, the performance of the RNNsearch is as high as that of the conventional phrase-based translation system (Moses) when only the sentences consisting of known words are considered. This is a significant achievement, considering that Moses uses a separate monolingual corpus (418M words) in addition to the parallel corpora, RNNsearch and RNNencdec uses much smaller corpus.

The BLEU scores of the generated translations on the test set with respect to the lengths of the sentences
  • In the above figure, it can be seen that the performance of RNNencdec dramatically drops as the length of the sentences increases.
  • On the other hand, both RNNsearch-30 and RNNsearch-50 are more robust to the length of the sentences.

RNNsearch-50, especially, shows no performance deterioration even with sentences of length 50 or more.

  • And the RNNsearch-30 even outperforms RNNencdec-50.

4.4. Qualitative Results

Four sample alignments found by RNNsearch-50 (a) an arbitrary sentence. (b–d) three randomly selected samples among the sentences without any unknown words and of length between 10 and 20 words from the test set.
  • The alignment of words between English and French is largely monotonic. We see strong weights along the diagonal of each matrix.
  • (d): Any hard alignment will map [the] to [l’] and [man] to [homme]. The proposed soft-alignment solves this issue naturally by letting the model look at both [the] and [man], and in this example, we see that the model was able to correctly translate [the] into [l’].

4.5. Long Sentence

  • The proposed model (RNNsearch) is much better than the conventional model (RNNencdec) at translating long sentences. Consider this source sentence from the test set:
Input Long Sentences
  • The RNNencdec-50 translated this sentence into:
Output by RNNencdec-50
  • The RNNencdec-50 correctly translated the source sentence until [a medical center]. However, from there on (underlined), it deviated from the original meaning of the source sentence.
  • On the other hand, the RNNsearch-50 generated the following correct translation, preserving the whole meaning of the input sentence without omitting any details:
Output by RNNsearch-50
  • (The above is the comments from authors. I do not understand French...)


[2015 ICLR] [Attention Decoder/RNNSearch]
Neural Machine Translation by Jointly Learning to Align and Translate

Natural Language Processing (NLP)

Machine Translation: 2014 [Seq2Seq] [RNN Encoder-Decoder] 2015 [Attention Decoder/RNNSearch]
Image Captioning: 2015 [m-RNN] [R-CNN+BRNN] [Show and Tell/NIC]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.