Brief Review — Understanding Back-Translation at Scale

Back Translation+Sampling, Good at Low Resource Languages

3 min readSep 24, 2022

--

Understanding Back-Translation at Scale
Back-Translation+Sampling, by Facebook AI Research, and Google Brain
2018 EMNLP, Over 700 Citations (Sik-Ho Tsang @ Medium)
Natural Language Processing, NLP, Neural Machine Translation, NMT

Back Translation of target language sentences, is an effective method to generate synthetic source sentences.
It is found that in resource poor settings, Back Translations obtained via sampling or noised beam outputs are most effective.

Outline

Back Translation+Sampling
Results

1. Back Translation+Sampling

Back Translation typically uses beam search or just greedy search, cannot model the true data distribution.

As alternative, sampling from the model distribution as well as adding noise are considered to beam search outputs.

1.1. Sampling

Unrestricted sampling is explored, which generates outputs that are very diverse but sometimes highly unlikely.

Sampling restricted to the most likely words is investigated. At each time step, The k most likely tokens from the output distribution, are selected, renormalized and then sampled from this restricted set.

1.2. Noise

Adding noise to input sentences has been very beneficial which is inspired by Denoising Autoencoders.

Three types of noise: deleting words with probability 0.1, replacing words by a filler token with probability 0.1, and swapping words as a random permutation over the tokens.

But later in experiment, sampling is used instead of adding noise.

1.3. Model

Big Transformer is used by utilizing the FAIRSEQ toolkit, which has 6 Transformer blocks in the encoder and decoder.

2. Results

2.1. Synthetic Data Generation Methods

**Accuracy of models trained on different amounts of back-translated data on newstest2012 of WMT English-German translation**

Sampling and beam+noise improve over bitext-only (5M) by between 1.7–2 BLEU in the largest data setting.

**Tokenized BLEU on various test sets of WMT English-German when adding 24M synthetic sentence pairs**

Sampling and beam+noise perform roughly equal. Sampling is used in the remaining experiments.

**Example where sampling produces inadequate outputs** (”Mr President,” is not in the source. BLANK means that a word has been replaced by a filler token.)

2.2. Low Resource vs High Resource

**BLEU when adding synthetic data from beam and sampling to bitext systems with 80K, 640K and 5M sentence pairs**

Sampling is more effective than beam for larger setups (640K and 5.2M bitexts) while the opposite is true (Beam) for resource poor settings (80K bitext).

2.3. SOTA Comparison

**BLEU on newstest2014 for WMT English-German (En–De) and English-French (En–Fr)**

DeepL, a commercial translation engine relying on high quality bilingual training data, achieves 33.3 tokenized BLEU.

Back Translation with sampling can result in high-quality translation models based on benchmark data only.

Reference

[2018 EMNLP] [Back Translation+Sampling]
Understanding Back-Translation at Scale

4.2. Machine Translation

2014 … 2018 [Back Translation+Sampling] … 2020 [Batch Augment, BA] [GPT-3] [T5] [Pre-LN Transformer] [OpenNMT] 2021 [ResMLP] [GPKD]