Brief Review — Understanding Back-Translation at Scale

Back Translation+Sampling, Good at Low Resource Languages

Sik-Ho Tsang
3 min readSep 24, 2022
Back Translation (Image from here)

Understanding Back-Translation at Scale
Back-Translation+Sampling, by Facebook AI Research, and Google Brain
2018 EMNLP, Over 700 Citations (Sik-Ho Tsang @ Medium)
Natural Language Processing, NLP, Neural Machine Translation, NMT

  • Back Translation of target language sentences, is an effective method to generate synthetic source sentences.
  • It is found that in resource poor settings, Back Translations obtained via sampling or noised beam outputs are most effective.


  1. Back Translation+Sampling
  2. Results

1. Back Translation+Sampling

  • Back Translation typically uses beam search or just greedy search, cannot model the true data distribution.

As alternative, sampling from the model distribution as well as adding noise are considered to beam search outputs.

1.1. Sampling

  • Unrestricted sampling is explored, which generates outputs that are very diverse but sometimes highly unlikely.

Sampling restricted to the most likely words is investigated. At each time step, The k most likely tokens from the output distribution, are selected, renormalized and then sampled from this restricted set.

1.2. Noise

Three types of noise: deleting words with probability 0.1, replacing words by a filler token with probability 0.1, and swapping words as a random permutation over the tokens.

  • But later in experiment, sampling is used instead of adding noise.

1.3. Model

2. Results

2.1. Synthetic Data Generation Methods

Accuracy of models trained on different amounts of back-translated data on newstest2012 of WMT English-German translation

Sampling and beam+noise improve over bitext-only (5M) by between 1.7–2 BLEU in the largest data setting.

Tokenized BLEU on various test sets of WMT English-German when adding 24M synthetic sentence pairs

Sampling and beam+noise perform roughly equal. Sampling is used in the remaining experiments.

Example where sampling produces inadequate outputs (”Mr President,” is not in the source. BLANK means that a word has been replaced by a filler token.)

2.2. Low Resource vs High Resource

BLEU when adding synthetic data from beam and sampling to bitext systems with 80K, 640K and 5M sentence pairs

Sampling is more effective than beam for larger setups (640K and 5.2M bitexts) while the opposite is true (Beam) for resource poor settings (80K bitext).

2.3. SOTA Comparison

BLEU on newstest2014 for WMT English-German (En–De) and English-French (En–Fr)
  • DeepL, a commercial translation engine relying on high quality bilingual training data, achieves 33.3 tokenized BLEU.

Back Translation with sampling can result in high-quality translation models based on benchmark data only.


[2018 EMNLP] [Back Translation+Sampling]
Understanding Back-Translation at Scale

4.2. Machine Translation

20142018 [Back Translation+Sampling] … 2020 [Batch Augment, BA] [GPT-3] [T5] [Pre-LN Transformer] [OpenNMT] 2021 [ResMLP] [GPKD]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.