Brief Review — Unsupervised Machine Translation Using Monolingual Corpora Only

UNMT, NMT Model Trained Without Parallel Data Using GAN

Sik-Ho Tsang
6 min readOct 15, 2022


Unsupervised Machine Translation Using Monolingual Corpora Only
UNMT, by Facebook AI Research, and Sorbonne Universités
2018 ICLR, Over 900 Citations (Sik-Ho Tsang @ Medium)
Natural Language Processing, NLP, Neural Machine Translation, NMT, GAN

  • A model is proposed that takes sentences from monolingual corpora in two different languages and maps them into the same latent space.
  • By learning to reconstruct in both languages from this shared feature space, the model effectively learns to translate without using any labeled data.


  1. Preliminaries
  2. Unsupervised Neural Machine Translation (UNMT)
  3. Experimental Results

1. Preliminaries

1.1. Model

  • The NMT model is based on Attention Decoder/RNNSearch.
  • The model is composed of an encoder and a decoder, respectively responsible for encoding source and target sentences to a latent space, and to decode from that latent space to the source or the target domain.
  • WS denotes the set of words in the source domain associated with the (learned) words embeddings ZS.
  • Similarly, WT is the set of words in the target domain associated with the embeddings ZT.
  • Given an input sentence of m words x=(x1, x2, …, xm) in a particular language l, l ∈ {src, tgt}, an encoder eθenc,Z(x, l) computes a sequence of m hidden states z=(z1, z2, …, zm).
  • A decoder dθdec,Z(z, l) takes as input z and a language l, and generates an output sequence y=(y1, y2, …, yk).

The encoder and decoder are denoted as e(x, l) and d(z, l) for simplicity.

1.2. Datasets

  • A dataset of sentences in the source domain, is denoted by Dsrc, and another dataset in the target domain, is denoted by Dtgt.

2. Unsupervised Neural Machine Translation (UNMT)

2.1. Overview

Toy illustration of the principles guiding the design of our objective function. Left (autoencoding): the model is trained to reconstruct a sentence from a noisy version of it. Right (translation): the model is trained to translate a sentence in the other domain.
  • Left (autoencoding): The model is trained to reconstruct a sentence from a noisy version of it. x is the target, C(x) is the noisy input, ^x is the reconstruction.
  • Right (translation): The model is trained to translate a sentence in the other domain. The input is a noisy translation (in this case, from source-to-target) produced by the model itself, M, at the previous iteration (t), y=M(t)(x). The model is symmetric, and the same process is repeated in the other language.
Illustration of the proposed architecture and training objectives

2.2. Denoising Auto-Encoding

  • Considering a domain l=src or l=tgt, and a stochastic noise model denoted by C which operates on sentences, the following objective function is defined:
  • where:

^x is a reconstruction of the corrupted version of x, C(x), with x sampled from the monolingual dataset Dl. Δ is a measure of discrepancy between the two sequences, the sum of token-level cross-entropy losses.

  • C(x) is a randomly sampled noisy version of sentence x.
  • Two different types of noise are added:
  1. Every word in the input sentence is dropped with a probability pwd.
  2. The input sentence is slightly shuffled by permutation σ.
  • (Please feel free to read for more details about the noise model.)

2.3. Cross Domain Training

  • The second objective is to constrain the model to be able to map an input sentence from a the source/target domain l1 to the target/source domain l2.
  • A sentence x Dl1 is sampled, and a corrupted translation of this sentence in l2, C(M(x)), is generated.
  • The objective is thus to learn the encoder and the decoder such that they can reconstruct x from C(y).
  • The cross-domain loss can be written as:
  • where Δ is again the sum of token-level cross-entropy losses.

2.4. Adversarial Training

  • Similar to GAN, the discriminator is trained to classify between the encoding of source sentences and the encoding of target sentences:
  • On the other hand, the encoder is trained instead to fool the discriminator:

2.5. Final Objective

  • The final objective function at one iteration of the proposed learning algorithm is thus:
  • In parallel, the discriminator loss LD is minimized to update the discriminator.

2.6. Iterative Training

Unsupervised Training for Machine Translation
  • The model relies on an iterative algorithm which starts from an initial translation model M(1) (line 3). This is used to translate the available monolingual data, as needed by the cross-domain loss function of Equation 2.
  • At each iteration, a new encoder and decoder are trained by minimizing the loss of Equation 4 — line 7 of the algorithm. Then, a new translation model M(t+1) is created by composing the resulting encoder and decoder, and the process repeats.
  • To jump start, M(1) simply makes a word-by-word translation of each sentence using a parallel dictionary learned by the unsupervised method proposed by CSLS.

2.7. Model Selection Criteria

  • However, there are no parallel sentences to judge how well the model translates, not even at validation time.
  • The quality of the model is then evaluated by computing the BLEU score over the original inputs and their reconstructions via this two-step translation process. The performance is then averaged over the two directions, and the selected model is the one with the highest average score.
Unsupervised model selection
  • Then, the model is selected based on the average BLEU score as above.

3. Experimental Results

3.1. Datasets

Multi30k-Task1 and WMT datasets statistics To limit the vocabulary size in the WMT en-fr and WMT de-en datasets, only words with more than 100 and 25 occurrences are considered, respectively

3.2. Results

BLEU score on the Multi30k-Task1 and WMT datasets using greedy decoding
  • After just one iteration, BLEU score of 27.48 and 12.10 are obtained for the en-fr task on Multi30k-Task1 and WMT respectively.
  • After a few iterations, the model obtains BLEU of 32.76 and 15.05 on Multi30k-Task1 and WMT datasets for the en-fr task, which is rather remarkable.

Supervised NMT obtains the highest BLEU, as it got parallel data for training, here UNMT has already got an impressive result.

Left: BLEU as a function of the number of iterations of our algorithm on the Multi30k-Task1 datasets. Right: The curves show BLEU as a function of the amount of parallel data on WMT datasets

Left: Subsequent iterations yield significant gains although with diminishing returns. At iteration 3, the performance gains are marginal, showing that our approach quickly converges.

Right: The unsupervised method which leverages about 15 million monolingual sentences in each language, obtains the same performance than a supervised NMT model trained on about 100,000 parallel sentences, which is impressive.

Unsupervised translations

The quality of the translations increases at every iteration.

With the use of GAN idea, NMT model can be trained without parallel data, in which I think it is similar to the CycleGAN in image domain.


[2018 ICLR] [UMNT]
Unsupervised Machine Translation Using Monolingual Corpora Only

2.1. Generative Adversarial Network (GAN)

Machine Translation: 2018 [UMNT]

4.2. Machine Translation

2013 … 2018 [UMNT] … 2020 [Batch Augment, BA] [GPT-3] [T5] [Pre-LN Transformer] [OpenNMT] 2021 [ResMLP] [GPKD]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.