Review — Word Translation Without Parallel Data

Adversarial Training with Cross-Domain Similarity Local Scaling (CSLS) for NMT model

Word Translation Without Parallel Data
, by Facebook AI Research, and Sorbonne Universités
2018 ICLR, Over 1300 Citations (Sik-Ho Tsang @ Medium)
Natural Language Processing, NLP, Neural Machine Translation, NMT

  • A bilingual dictionary is built between two languages without using any parallel corpora, by aligning monolingual word embedding spaces in an unsupervised way.


  1. Overall Method
  2. Adversarial Training
  3. Experimental Results

1. Overall Method

Toy illustration of the method
  • (A): There are two distributions of word embeddings, English words in red denoted by X and Italian words in blue denoted by Y, which we want to align/translate.
  • The size of the dot is proportional to the frequency of the words.
  • (B): Using adversarial learning, a rotation matrix W is learnt, which roughly aligns the two distributions.
  • Green stars are randomly selected words that are fed to the discriminator to determine whether the two word embeddings come from the same distribution.
  • (C): The mapping W is further refined via Procrustes.
  • (D): Finally, the mapping W is used to translate. And a distance metric, CSLS, is proposed that expands the space where there is high density of points (like the area around the word “cat”), so that “hubs” (like the word “cat”) become less close to other word vectors.

2. Adversarial Training

  • Let X={x1, …, Xn} and Y={y1, …, ym} be two sets of n and m word embeddings coming from a source and a target language respectively.
  • A discriminator is trained to discriminate between elements randomly sampled from WX={Wx1, …, Wxn} and Y.
  • W is trained to prevent the discriminator from making accurate predictions. As a result, this is a two-player game.

2.1. Discriminator Objective

  • Consider the probability PθD(source=1|z) that a vector z is the mapping of a source embedding (as opposed to a target embedding) according to the discriminator. The discriminator loss can be written as:

2.2. Mapping Objective

  • In the unsupervised setting, W is now trained so that the discriminator is unable to accurately predict the embedding origins:
  • For every input sample, the discriminator and the mapping matrix W are trained successively with stochastic gradient updates.

2.3. Refinement

  • To refine the mapping, we build a synthetic parallel vocabulary using the W just learned with adversarial training.
  • It is possible to generate a more accurate dictionary and apply this method iteratively. But it is found that more than one iteration has little further improvement only.

2.4. Cross-Domain Similarity Local Scaling (CSLS)

  • A bi-partite neighborhood graph is considered, in which each word of a given dictionary is connected to its K nearest neighbors in the other language.
  • NT(Wxs) denotes the neighborhood on this bi-partite graph, associated with a mapped source word embedding Wxs. All K elements of NT(Wxs) are words from the target language. Similarly NT(yt) denotes the neighborhood associated with a word t of the target language.
  • The mean similarity of a source embedding xs to its target neighborhood is considered as:
  • where cos( , ) is the cosine similarity.
  • Likewise, rS(yt) denotes the mean similarity of a target word yt to its neighborhood.
  • A similarity measure CSLS( , ) between mapped source words and target words, as:

Intuitively, this update increases the similarity associated with isolated word vectors. Conversely it decreases the ones of vectors lying in dense areas. The experiments show that the CSLS significantly increases the accuracy for word translation retrieval, while not requiring any parameter tuning.

2.5. Other Details

  • An unsupervised word vectors by fastText are used .
  • Discriminator: A multilayer perceptron with two hidden layers of size 2048, and Leaky-ReLU activation functions. Only the 50,000 most frequent words are fed into the discriminator.
  • A simple update step to ensure that the matrix W stays close to an orthogonal matrix during training:

3. Experimental Results

Word translation retrieval P@1 for our released vocabularies in various language pairs

Adv-Refine-CSLS obtains the best performance on en-es, es-en, en-fr, en-de and eo-en while CSLS does not use cross-lingual supervision.

English-Italian word translation average precisions (@1, @5, @10) from 1.5k source word queries using 200k target words

Adv-Refine-CSLS obtains the best performance on English-Italian, and Italian-English while CSLS does not use cross-lingual supervision.

Adversarial training can be used for training NMT model when there is no parallel data between source and target languages.


[2018 ICLR] [CSLS]
Word Translation Without Parallel Data

4.2. Machine Translation

2013 … 2018 [CSLS] … 2020 [Batch Augment, BA] [GPT-3] [T5] [Pre-LN Transformer] [OpenNMT] 2021 [ResMLP] [GPKD]

My Other Previous Paper Readings



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store