Brief Review — Understanding Back-Translation at Scale
Understanding Back-Translation at Scale
Back-Translation+Sampling, by Facebook AI Research, and Google Brain
2018 EMNLP, Over 700 Citations (Sik-Ho Tsang @ Medium)
Natural Language Processing, NLP, Neural Machine Translation, NMT
1. Back Translation+Sampling
- Back Translation typically uses beam search or just greedy search, cannot model the true data distribution.
As alternative, sampling from the model distribution as well as adding noise are considered to beam search outputs.
- Unrestricted sampling is explored, which generates outputs that are very diverse but sometimes highly unlikely.
Sampling restricted to the most likely words is investigated. At each time step, The k most likely tokens from the output distribution, are selected, renormalized and then sampled from this restricted set.
- Adding noise to input sentences has been very beneficial which is inspired by Denoising Autoencoders.
Three types of noise: deleting words with probability 0.1, replacing words by a filler token with probability 0.1, and swapping words as a random permutation over the tokens.
- But later in experiment, sampling is used instead of adding noise.
2.1. Synthetic Data Generation Methods
Sampling and beam+noise improve over bitext-only (5M) by between 1.7–2 BLEU in the largest data setting.
Sampling and beam+noise perform roughly equal. Sampling is used in the remaining experiments.
2.2. Low Resource vs High Resource
Sampling is more effective than beam for larger setups (640K and 5.2M bitexts) while the opposite is true (Beam) for resource poor settings (80K bitext).
2.3. SOTA Comparison
- DeepL, a commercial translation engine relying on high quality bilingual training data, achieves 33.3 tokenized BLEU.
Back Translation with sampling can result in high-quality translation models based on benchmark data only.