Review — Word Translation Without Parallel Data

Adversarial Training with Cross-Domain Similarity Local Scaling (CSLS) for NMT model

4 min readSep 17, 2022

Word Translation Without Parallel Data
CSLS, by Facebook AI Research, and Sorbonne Universités
2018 ICLR, Over 1300 Citations (Sik-Ho Tsang @ Medium)
Natural Language Processing, NLP, Neural Machine Translation, NMT

A bilingual dictionary is built between two languages without using any parallel corpora, by aligning monolingual word embedding spaces in an unsupervised way.

Outline

Overall Method
Adversarial Training
Experimental Results

1. Overall Method

(A): There are two distributions of word embeddings, English words in red denoted by X and Italian words in blue denoted by Y, which we want to align/translate.
The size of the dot is proportional to the frequency of the words.
(B): Using adversarial learning, a rotation matrix W is learnt, which roughly aligns the two distributions.
Green stars are randomly selected words that are fed to the discriminator to determine whether the two word embeddings come from the same distribution.
(C): The mapping W is further refined via Procrustes.
(D): Finally, the mapping W is used to translate. And a distance metric, CSLS, is proposed that expands the space where there is high density of points (like the area around the word “cat”), so that “hubs” (like the word “cat”) become less close to other word vectors.

2. Adversarial Training

Let X={x1, …, Xn} and Y={y1, …, ym} be two sets of n and m word embeddings coming from a source and a target language respectively.
A discriminator is trained to discriminate between elements randomly sampled from WX={Wx1, …, Wxn} and Y.
W is trained to prevent the discriminator from making accurate predictions. As a result, this is a two-player game.

2.1. Discriminator Objective

Consider the probability PθD(source=1|z) that a vector z is the mapping of a source embedding (as opposed to a target embedding) according to the discriminator. The discriminator loss can be written as:

2.2. Mapping Objective

In the unsupervised setting, W is now trained so that the discriminator is unable to accurately predict the embedding origins:

For every input sample, the discriminator and the mapping matrix W are trained successively with stochastic gradient updates.

2.3. Refinement

To refine the mapping, we build a synthetic parallel vocabulary using the W just learned with adversarial training.
It is possible to generate a more accurate dictionary and apply this method iteratively. But it is found that more than one iteration has little further improvement only.

2.4. Cross-Domain Similarity Local Scaling (CSLS)

A bi-partite neighborhood graph is considered, in which each word of a given dictionary is connected to its K nearest neighbors in the other language.
NT(Wxs) denotes the neighborhood on this bi-partite graph, associated with a mapped source word embedding Wxs. All K elements of NT(Wxs) are words from the target language. Similarly NT(yt) denotes the neighborhood associated with a word t of the target language.
The mean similarity of a source embedding xs to its target neighborhood is considered as:

where cos( , ) is the cosine similarity.
Likewise, rS(yt) denotes the mean similarity of a target word yt to its neighborhood.
A similarity measure CSLS( , ) between mapped source words and target words, as:

Intuitively, this update increases the similarity associated with isolated word vectors. Conversely it decreases the ones of vectors lying in dense areas. The experiments show that the CSLS significantly increases the accuracy for word translation retrieval, while not requiring any parameter tuning.

2.5. Other Details

An unsupervised word vectors by fastText are used .
Discriminator: A multilayer perceptron with two hidden layers of size 2048, and Leaky-ReLU activation functions. Only the 50,000 most frequent words are fed into the discriminator.
A simple update step to ensure that the matrix W stays close to an orthogonal matrix during training: