Review — Neural Machine Translation by Jointly Learning to Align and Translate

Using Attention Decoder, Automatically Search for Part of Source Sentence at Encoder for Machine Translation

Attention Decoder/RNNSearch (Figure from


  1. Proposed Architecture Using Attention Decoder
  2. Encoder: Bidirectional RNN (BiRNN)
  3. Decoder: Attention Decoder
  4. Experimental Results

1. Proposed Architecture Using Attention Decoder

Proposed Architecture Using Attention Decoder (Top: Decoder, Bottom: Encoder)
  • The top part is the decoder to output the translated sentence.

2. Encoder: Bidirectional RNN (BiRNN)

Encoder: Bidirectional RNN (BiRNN)
  • For the activation function f of an RNN, the gated hidden unit proposed by RNN Encoder-Decoder is used.
  • This gated unit is similar to LSTM.
  • The model takes a source sentence of 1-of-K coded word vectors as input,
  • First, the forward states of the bidirectional recurrent neural network (BiRNN) are computed:
  • The forward and backward states are concatenated to obtain the annotations:

3. Decoder: Attention Decoder

Decoder: Attention Decoder

3.1. Hidden State si

  • The hidden state si of the decoder given the annotations from the encoder is computed by:

3.2. Context Vector ci

  • The context vector ci are recomputed at each step by the alignment model:
  • αij is the softmax output based on eij and all eik.
  • (Hence, the figure in the paper, to me, it is not clear enough, and somehow misleading me. Because αij is also calculated based on si-1 besides hj. There should be an arrow pointing from si-1 to αij as well.)
  • If ci is fixed to →hTx, it is RNN Encoder-Decoder (RNNencdec).

3.3. Target Word yi

  • With the decoder state si-1, the context ci and the last generated word yi-1, the probability of a target word yi is defined as:

4. Experimental Results

4.1. Dataset

  • WMT ’14 contains the following English-French parallel corpora: Europarl (61M words), news commentary (5.5M), UN (421M) and two crawled corpora of 90M and 272.5M words respectively, totaling 850M words.
  • The size of the combined corpus is reduced to 348M words.
  • Monolingual data is not used, only parallel corpora is used.
  • 2012 and news-test-2013 to make a development (validation) set, and evaluate the models on the test
  • news-test-2012 and news-test-2013 are concatenated as a development (validation) set.
  • The models are evaluated on the test set (news-test-2014) from WMT ’14, which consists of 3003 sentences not present in the training data.
  • After a usual tokenization, a shortlist of 30,000 most frequent words is used in each language to train the models.
  • Any word not included in the shortlist is mapped to a special token ([UNK]).
  • No other special preprocessing, such as lowercasing or stemming, to the data.

4.2. Models

  • Two models: The first one is an RNN Encoder–Decoder (RNNencdec), the second one is the proposed attention decoder, i.e. RNNSearch.
  • Each model is trained twice: First with the sentences of length up to 30 words (RNNencdec-30, RNNsearch-30) and then with the sentences of length up to 50 word (RNNencdec-50, RNNsearch-50).
  • The encoder and decoder of the RNNencdec have 1000 hidden units each.
  • A multilayer network with a single Maxout hidden layer to compute the conditional probability of each target word.
  • A minibatch of 80 sentences is used, and each model is trained for approximately 5 days.
  • A beam search to find a translation that approximately maximizes the conditional probability.

4.3. BLEU Results

BLEU scores of the trained models computed on the test set (RNNsearch-50* was trained much longer)
The BLEU scores of the generated translations on the test set with respect to the lengths of the sentences
  • On the other hand, both RNNsearch-30 and RNNsearch-50 are more robust to the length of the sentences.

4.4. Qualitative Results

Four sample alignments found by RNNsearch-50 (a) an arbitrary sentence. (b–d) three randomly selected samples among the sentences without any unknown words and of length between 10 and 20 words from the test set.
  • (d): Any hard alignment will map [the] to [l’] and [man] to [homme]. The proposed soft-alignment solves this issue naturally by letting the model look at both [the] and [man], and in this example, we see that the model was able to correctly translate [the] into [l’].

4.5. Long Sentence

  • The proposed model (RNNsearch) is much better than the conventional model (RNNencdec) at translating long sentences. Consider this source sentence from the test set:
Input Long Sentences
Output by RNNencdec-50
  • On the other hand, the RNNsearch-50 generated the following correct translation, preserving the whole meaning of the input sentence without omitting any details:
Output by RNNsearch-50


[2015 ICLR] [Attention Decoder/RNNSearch]
Neural Machine Translation by Jointly Learning to Align and Translate

Natural Language Processing (NLP)

Machine Translation: 2014 [Seq2Seq] [RNN Encoder-Decoder] 2015 [Attention Decoder/RNNSearch]
Image Captioning: 2015 [m-RNN] [R-CNN+BRNN] [Show and Tell/NIC]

My Other Previous Paper Readings



PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store