# Brief Review — Unsupervised Machine Translation Using Monolingual Corpora Only

## UNMT, NMT Model Trained Without Parallel Data Using GAN

Unsupervised Machine Translation Using Monolingual Corpora OnlyUNMT, by Facebook AI Research, and Sorbonne Universités2018 ICLR, Over 900 Citations(Sik-Ho Tsang @ Medium)

Natural Language Processing, NLP, Neural Machine Translation, NMT, GAN

- A model is proposed that
**takes sentences from monolingual corpora in two different languages**and**maps them into the same latent space**. - By learning to
**reconstruct in both languages from this shared feature space**, the model effectively**learns to translate without using any labeled data.**

# Outline

**Preliminaries****Unsupervised Neural Machine Translation (UNMT)****Experimental Results**

# 1. **Preliminaries**

## 1.1. Model

- The NMT model is based on
**Attention Decoder/RNNSearch**. - The model is composed of an
**encoder**and a**decoder**, respectively responsible for**encoding source and target sentences to a latent space**, and to**decode from that latent space to the source or the target domain.** denotes*WS***the set of words in the source domain**associated with the (learned)**words embeddings**.*ZS*- Similarly,
is*WT***the set of words in the target domain**associated with the**embeddings**.*ZT* - Given
**an input sentence of**in a particular*m*words*x*=(*x*1,*x*2, …,*xm*)**language**,*l*,*l*∈ {*src*,*tgt*}**an encoder***eθenc*,*Z*(*x*,*l*) computes a sequence of*m*hidden states*z*=(*z*1,*z*2, …,*zm*). **A decoder**takes as*dθdec*,*Z*(*z*,*l*)**input**and*z***a language**, and*l***generates an output sequence**.*y*=(*y*1,*y*2, …,*yk*)

The

encoderanddecoderare denoted asande(x,l)for simplicity.d(z,l)

## 1.2. Datasets

- A dataset of sentences in the
**source****domain**, is denoted by, and another dataset in the*Dsrc***target domain**, is denoted by.*Dtgt*

# 2. Unsupervised Neural Machine Translation (UNMT)

## 2.1. Overview

**Left (autoencoding)**: The model is trained to**reconstruct a sentence from a noisy version of it**.is the*x***target**,is the*C*(*x*)**noisy input**,**^**is the*x***reconstruction**.**Right (translation)**: The model is trained to translate a sentence in the other domain. The input is a noisy translation (in this case, from source-to-target) produced by the model itself,*M*, at the previous iteration (*t*),. The model is symmetric, and the same process is repeated in the other language.*y*=*M*(*t*)(*x*)

## 2.2. Denoising Auto-Encoding

- Considering a
**domain**, and*l*=*src*or*l*=*tgt***a stochastic noise model denoted by**which operates on sentences, the following*C***objective function**is defined:

- where:

^is axreconstruction of the corrupted version of, withx,C(x)is a measure of discrepancy between the two sequences,xsampled from the monolingual datasetDl. Δthe sum of token-level cross-entropy losses.

is a randomly sampled noisy version of sentence*C*(*x*)*x*.**Two different types of noise**are added:

- Every
**word**in the input sentence is**dropped**with a**probability**.*pwd* - The input sentence is slightly
**shuffled by permutation**.*σ*

- (Please feel free to read for more details about the noise model.)

## 2.3. Cross Domain Training

- The
**second objective**is to constrain the model to be able to**map an input sentence from a the source/target domain**.*l*1 to the target/source domain*l*2 **A sentence**, and*x*∈*Dl*1 is sampled**a corrupted translation of this sentence in**.*l*2,*C*(*M*(*x*)), is generated- The objective is thus to
**learn the encoder and the decoder**such that they can**reconstruct**.*x*from*C*(*y*) - The
**cross-domain loss**can be written as:

- where
**Δ**is again**the sum of token-level cross-entropy losses**.

## 2.4. Adversarial Training

- Similar to GAN, the
**discriminator**is trained to**classify between the encoding of source sentences and the encoding of target sentences**:

- On the other hand, the
**encoder**is trained instead to**fool the discriminator**:

## 2.5. Final Objective

- The
**final objective function at one iteration**of the proposed learning algorithm is thus:

- In parallel, the discriminator loss
*LD*is minimized to update the discriminator.

## 2.6. Iterative Training

- The model relies on an iterative algorithm which starts from
**an initial translation model**. This is used to*M*(1) (line 3)**translate the available monolingual data**, as needed by the cross-domain loss function of Equation 2. **At each iteration**,**a new encoder and decoder are trained**by minimizing the loss of Equation 4 — line 7 of the algorithm. Then,**a new translation model**is created by composing the resulting encoder and decoder, and*M*(*t*+1)**the process repeats**.- To jump start,
simply makes a*M*(1)**word-by-word translation**of each sentence**using a parallel dictionary learned by the unsupervised method proposed by CSLS**.

## 2.7. Model Selection Criteria

- However, there are no parallel sentences to judge how well the model translates, not even at validation time.
- The
**quality of the model**is then**evaluated by computing the BLEU score over the original inputs and their reconstructions**via this two-step translation process. The performance is then**averaged over the two directions**, and the selected model is the one with the highest average score.

- Then, the
**model is selected**based on the**average BLEU score**as above.

# 3. Experimental Results

## 3.1. Datasets

## 3.2. Results

**After just one iteration**,**BLEU score of 27.48 and 12.10**are obtained for the**en-fr**task on**Multi30k-Task1**and**WMT**respectively.**After a few iterations**, the model obtains**BLEU of 32.76 and 15.05**on**Multi30k-Task1**and**WMT**datasets for the**en-fr**task, which is rather remarkable.

Supervised NMT obtains the highest BLEU, as it got parallel data for training, here UNMT has already got an impressive result.

Left: Subsequent iterations yield significant gains although with diminishing returns. At iteration 3, the performance gains are marginal, showing that our approach quickly converges.

Right: The unsupervised method whichleverages about 15 million monolingual sentences in each language, obtains thesame performance than a supervised NMT modeltrained on about100,000 parallel sentences, which is impressive.

The

qualityof the translationsincreases at every iteration.

With the use of GAN idea, NMT model can be trained without parallel data, in which I think it is similar to the CycleGAN in image domain.

## Reference

[2018 ICLR] [UMNT]

Unsupervised Machine Translation Using Monolingual Corpora Only

## 2.1. Generative Adversarial Network (GAN)

**Machine Translation:** **2018 **[UMNT]

## 4.2. Machine Translation

**2013 … 2018 **[UMNT]** … 2020 **[Batch Augment, BA] [GPT-3] [T5] [Pre-LN Transformer] [OpenNMT] **2021 **[ResMLP] [GPKD]