Review — Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation (GNMT)

GNMT, Wordpiece Model, Using Deep LSTM With Residual Connections

Sik-Ho Tsang
8 min readNov 20, 2021

In this story, Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, (GNMT), by Google, is reviewed. In this paper:

  • A deep LSTM network with 8 encoder and 8 decoder layers using residual connections as well as attention connections from the decoder network is used.
  • Parallelism is used to reduce the training time.
  • Low-precision arithmetic is used for faster inference.
  • Wordpiece-based translation is used instead of word or character-based.
  • Beam search technique employs a length-normalization procedure and uses a coverage penalty.
  • Reinforcement learning is used to refine the model to directly optimize the BLEU score.

This is a paper in 2016 arXiv with over 4800 citations. (Sik-Ho Tsang @ Medium) There are many improvements in the paper as also seen that the author list is quite long with 31 authors. I try to make this article short.

In 2016 Sep, Google started to use Neural Machine Translation (NMT) instead of Statistical Machine Translation (SMT), which had been used since 2017 Oct.


  1. GNMT Network Architecture
  2. Model Parallelism
  3. Wordpiece Model or Mixed Word/Character Model
  4. Model Training
  5. Quantizable Model and Quantized Inference
  6. Experimental Results

1. GNMT Network Architecture

GNMT: Network Architecture
  • On the left is the encoder network, on the right is the decoder network, in the middle is the attention module.
  • Let (X, Y) be a source and target sentence pair. Let X=x1, x2, x3, …, xM be the sequence of M symbols in the source sentence and let Y=y1, y2, y3, …, yN be the sequence of N symbols in the target sentence.
  • In this equation, x1, x2, x3, …, xM is a list of fixed size vectors.

1.1. Encoder

The structure of bi-directional connections in the first layer of the encoder
  • The bottom encoder layer is bi-directional: the pink nodes gather information from left to right while the green nodes gather information from right to left.
  • The other layers of the encoder are uni-directional.

1.2. Attention

The attention module is similar to the one in Attention Decoder.

  • More specifically, let yi-1 be the decoder-RNN output from the past decoding time step. Attention context ai for the current time step is computed:
  • where AttentionFunction in our implementation is a feed forward network with one hidden layer.

1.3. Decoder

  • 8 LSTM layers are used for the decoder.
  • Beam search is used during decoding to find the sequence Y that maximizes a score function s(Y, X) given a trained model.
  • During beam search, 8–12 hypotheses are typically kept but it is found that using fewer (4 or 2) has only slight negative effects on BLEU scores.
  • Two important refinements are used to the pure max-probability based beam search algorithm: a coverage penalty [42] and length normalization.
  • This speeds up search by 30%-40% when run on CPUs.
  • (Please feel free to read the paper for more details.)

1.4. Residual Connections

Left: Normal Stacked LSTM, Right: Stacked LSTM with Residual Connections
  • Left: Deep normal stacked LSTMs are difficult to train due to exploding and vanishing gradient problems:
  • It is found that simple stacked LSTM layers work well up to 4 layers, barely with 6 layers, and very poorly beyond 8 layers.

Right: Residual connections greatly improve the gradient flow in the backward pass, 8 LSTM layers are used for the encoder and decoder:

  • where x=m+x before going into next LSTM.

2. Model Parallelism

  • The n replicas all share one copy of model parameters. n is often around 10. Each replica works on a mini-batch of m sentence pairs at a time, which is often 128 in the experiments.

The encoder and decoder networks are partitioned along the depth dimension and are placed on multiple GPUs, effectively running each layer on a different GPU, as shown in the figure of GNMT network architecture.

  • Since all but the first encoder layer are uni-directional, layer i+1 can start its computation before layer i is fully finished.

The softmax layer is also partitioned, with each partition responsible for a subset of symbols.

3. Wordpiece Model or Mixed Word/Character Model

  • There are two categories: Word-based and character-based models. In this work, GMNT proposes wordpiece model.
  • To be brief, word-based model predicts word by word, and rare words are difficult to handle.
  • Character-based model predicts character by character, the meaning of words is somehow lost.
  • An example of turning words into wordpieces:
  • Word: Jet makers feud over seat width with big orders at stake
  • Wordpieces: _J et _makers _fe ud _over _seat _width _with _big _orders _at _stake

Wordpieces achieve a balance between the flexibility of characters and efficiency of words.

  • (Please feel free to read the paper for more details.)

4. Model Training

  • The standard maximum-likelihood (ML) training objective is used:
  • But this does not directly improve the BLEU scores.
  • In brief, model refinement is done using the expected reward objective by reinforcement learning (RL):
  • where r(Y, Y *(i)) denotes the per-sentence score, and we are computing an expectation over all of the output sentences Y, up to a certain length.
  • To further stabilize training, a linear combination of ML and RL objectives is optimized as:
  • Due to the disadvantage of BLEU score, GLEU score is proposed.
  • (Please feel free to read the paper for more details.)

5. Quantizable Model and Quantized Inference

  • Neural machine translation is computationally intensive at inference, making low latency translation difficult, and high volume deployment computationally expensive.
  • For quantized inference, GNMT explicitly constrains the values of these accumulators to be within [-δ, δ] to guarantee a certain range that can be used for quantization later. The forward computation of an LSTM stack with residual connections is as follows:
  • The weights of fixed-point integer operations are replaced with either 8-bit or 16-bit resolution.
  • There is also softmax clipping γ:
  • During training of the model, full-precision is used. The only constraints added to the model during training are the clipping.
Log perplexity vs. steps
  • And it is shown that it does not affect the training at all.
Model inference on CPU, GPU and TPU
  • Inference using CPU is faster than the one in GPU due to data transfer.
  • TPU is optimized which makes the inference much faster.
  • (Please feel free to read the paper for more details.)

6. Experimental Results

6.1. ML Training Models

Single model results on WMT En > Fr (newstest2014)
  • The best vocabulary size for the mixed word-character model is 32K.
  • The best model WPM-32K, achieves a BLEU score of 38.95. Note that this BLEU score represents the averaged score of 8 models. The maximum BLEU score of the 8 models is higher at 39.37.
Single model results on WMT En > De (newstest2014)
  • WMT En > De is considered a more difficult task than WMT En > Fr as it has much less training data.
  • It is more advantageous to use wordpiece or mixed word/character models, which provide a gain of more than 2 BLEU points on top of the word model and about 4 BLEU points on top of previously reported results in [6] and Deep-Att [45].

6.2. RL Training Models

Single model test BLEU scores, averaged over 8 runs
  • On WMT En > Fr, model refinement improves BLEU score by close to 1 point.
  • On WMT En > De, RL-refinement slightly hurts the test performance.

6.3. Model Ensemble and Human Evaluation

Model ensemble results on WMT En > Fr (newstest2014)
Model ensemble results on WMT En > De (newstest2014)
  • GNMT ensembles 8 RL-refined models to obtain a state-of-the-art result of 41.16 BLEU points on the WMT En > Fr dataset, outperforms Deep-Att.
  • GNMT ensembles 8 RL-refined models to obtain a state-of-the-art result of 26.30 BLEU points on the WMT En > De dataset.
Human side-by-side evaluation scores of WMT En > Fr models
  • During the side-by-side comparison, humans are asked to rate four translations given a source sentence.
  • Side-by-side scores range from 0 to 6, with a score of 0 meaning “completely nonsense translation”, and a score of 6 meaning “perfect translation.
  • The results show that even though RL refinement can achieve better BLEU scores, it barely improves the human impression of the translation quality.

6.4. Results on Production Data

Histogram of side-by-side scores on 500 sampled sentences from Wikipedia and news websites for a typical language pair, here English > Spanish (PBMT blue, GNMT red, Human orange)
Mean of side-by-side scores on production data
  • Google’s translation production corpora are two to three decimal orders of magnitudes bigger than the WMT corpora.
  • GMNT reduces translation errors by more than 60% compared to the PBMT model on these major pairs of languages.


[2016 arXiv] [GNMT]
Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Natural Language Processing

Sequence Model: 2014 [GRU] [Doc2Vec]
Language Model: 2007 [Bengio TNN’07] 2013 [Word2Vec] [NCE] [Negative Sampling]
Sentence Embedding: 2015 [Skip-Thought]
Machine Translation: 2014
[Seq2Seq] [RNN Encoder-Decoder] 2015 [Attention Decoder/RNNSearch] 2016 [GNMT]
Image Captioning: 2015 [m-RNN] [R-CNN+BRNN] [Show and Tell/NIC]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.