# Brief Review — Unsupervised Machine Translation Using Monolingual Corpora Only

## UNMT, NMT Model Trained Without Parallel Data Using GAN

Unsupervised Machine Translation Using Monolingual Corpora OnlyUNMT, by Facebook AI Research, and Sorbonne Universités2018 ICLR, Over 900 Citations(Sik-Ho Tsang @ Medium)

Natural Language Processing, NLP, Neural Machine Translation, NMT, GAN

- A model is proposed that
**takes sentences from monolingual corpora in two different languages**and**maps them into the same latent space**. - By learning to
**reconstruct in both languages from this shared feature space**, the model effectively**learns to translate without using any labeled data.**

# Outline

**Preliminaries****Unsupervised Neural Machine Translation (UNMT)****Experimental Results**

# 1. **Preliminaries**

## 1.1. Model

- The NMT model is based on
**Attention Decoder/RNNSearch**. - The model is composed of an
**encoder**and a**decoder**, respectively responsible for**encoding source and target sentences to a latent space**, and to**decode from that latent space to the source or the target domain.** denotes*WS***the set of words in the source domain**associated with the (learned)**words embeddings**.*ZS*- Similarly,
is*WT***the set of words in the target domain**associated with the**embeddings**.*ZT* - Given
**an input sentence of**in a particular*m*words*x*=(*x*1,*x*2, …,*xm*)**language**,*l*,*l*∈ {*src*,*tgt*}**an encoder***eθenc*,*Z*(*x*,*l*) computes a sequence of*m*hidden states*z*=(*z*1,*z*2, …,*zm*). **A decoder**takes as*dθdec*,*Z*(*z*,*l*)**input**and*z***a language**, and*l***generates an output sequence**.*y*=(*y*1,*y*2, …,*yk*)

The

encoderanddecoderare denoted asande(x,l)for simplicity.d(z,l)

## 1.2. Datasets

- A dataset of sentences in the
**source****domain**, is denoted by, and another dataset in the*Dsrc***target domain**, is denoted by.*Dtgt*

# 2. Unsupervised Neural Machine Translation (UNMT)

## 2.1. Overview

**Left (autoencoding)**: The model is trained to**reconstruct a sentence from a noisy version of it**.is the*x***target**,is the*C*(*x*)**noisy input**,**^**is the*x***reconstruction**.**Right (translation)**: The model is trained to translate a sentence in the other domain. The input is a noisy translation (in this case, from source-to-target) produced by the model itself,*M*, at the previous iteration (*t*),. The model is symmetric, and the same process is repeated in the other language.*y*=*M*(*t*)(*x*)

## 2.2. Denoising Auto-Encoding

- Considering a
**domain**, and*l*=*src*or*l*=*tgt***a stochastic noise model denoted by**which operates on sentences, the following*C***objective function**is defined:

- where:

^is axreconstruction of the corrupted version of, withx,C(x)is a measure of discrepancy between the two sequences,xsampled from the monolingual datasetDl. Δthe sum of token-level cross-entropy losses.

is a randomly sampled noisy version of sentence*C*(*x*)*x*.**Two different types of noise**are added:

- Every
**word**in the input sentence is**dropped**with a**probability**.*pwd* - The input sentence is slightly
**shuffled by permutation**.*σ*

- (Please feel free to read for more details about the noise model.)

## 2.3. Cross Domain Training

- The
**second objective**is to constrain the model to be able to**map an input sentence from a the source/target domain**.*l*1 to the target/source domain*l*2 **A sentence**, and*x*∈*Dl*1 is sampled**a corrupted translation of this sentence in**.*l*2,*C*(*M*(*x*)), is generated- The objective is thus to
**learn the encoder and the decoder**such that they can**reconstruct**.*x*from*C*(*y*) - The
**cross-domain loss**can be written as:

- where
**Δ**is again**the sum of token-level cross-entropy losses**.

## 2.4. Adversarial Training

- Similar to GAN, the
**discriminator**is trained to**classify between the encoding of source sentences and the encoding of target sentences**:

- On the other hand, the
**encoder**is trained instead to**fool the discriminator**:

## 2.5. Final Objective

- The
**final objective function at one iteration**of the proposed learning algorithm is thus:

- In parallel, the discriminator loss
*LD*is minimized to update the discriminator.

## 2.6. Iterative Training

- The model relies on an iterative algorithm which starts from
**an initial translation model**. This is used to*M*(1) (line 3)**translate the available monolingual data**, as needed by the cross-domain loss function of Equation 2. **At each iteration**,**a new encoder and decoder are trained**by minimizing the loss of Equation 4 — line 7 of the algorithm. Then,**a new translation model**is created by composing the resulting encoder and decoder, and*M*(*t*+1)**the process repeats**.- To jump start,
simply makes a*M*(1)**word-by-word translation**of each sentence**using a parallel dictionary learned by the unsupervised method proposed by CSLS**.

## 2.7. Model Selection Criteria

- However, there are no parallel sentences to judge how well the model translates, not even at validation time.
- The
**quality of the model**is then**evaluated by computing the BLEU score over the original inputs and their reconstructions**via this two-step translation process. The performance is then**averaged over the two directions**, and the selected model is the one with the highest average score.

- Then, the
**model is selected**based on the**average BLEU score**as above.

# 3. Experimental Results

## 3.1. Datasets

## 3.2. Results

**After just one iteration**,**BLEU score of 27.48 and 12.10**are obtained for the**en-fr**task on**Multi30k-Task1**and**WMT**respectively.**After a few iterations**, the model obtains**BLEU of 32.76 and 15.05**on**Multi30k-Task1**and**WMT**datasets for the**en-fr**task, which is rather remarkable.

Supervised NMT obtains the highest BLEU, as it got parallel data for training, here UNMT has already got an impressive result.

Left: Subsequent iterations yield significant gains although with diminishing returns. At iteration 3, the performance gains are marginal, showing that our approach quickly converges.

Right: The unsupervised method whichleverages about 15 million monolingual sentences in each language, obtains thesame performance than a supervised NMT modeltrained on about100,000 parallel sentences, which is impressive.

The

qualityof the translationsincreases at every iteration.

## Reference

[2018 ICLR] [UMNT]

Unsupervised Machine Translation Using Monolingual Corpora Only

## 2.1. Generative Adversarial Network (GAN)

**Machine Translation:** **2018 **[UMNT]

## 4.2. Machine Translation

**2013 … 2018 **[UMNT]** … 2020 **[Batch Augment, BA] [GPT-3] [T5] [Pre-LN Transformer] [OpenNMT] **2021 **[ResMLP] [GPKD]