Brief Review — Exploiting Similarities among Languages for Machine Translation

Early Paper for Word Translation Using Translation Matrix (TM)

4 min readOct 2, 2022

Exploiting Similarities among Languages for Machine Translation,
Translation Matrix (TM), by Google Inc.,
2013 arXiv, Over 1500 Citations (Sik-Ho Tsang @ Medium)
Natural Language Processing, NLP, Neural Machine Translation, NMT

By training a translation matrix (TM) to match the words in source language to those in target language, missing word and phrase entries can also be translated.

Outline

CBOW & Skip-Gram
Translation Matrix (TM)
Datasets & Results

1. CBOW & Skip-Gram

(If you know CBOW and Skip-Gram models in Word2Vec, please skip this section.)

**Graphical representation of the CBOW model and Skip-gram model**

Given a sequence of training words w1, w2, w3, …, wT, the objective of the Skip-gram model is to maximize the average log probability:

where k is the size of the training window.
The probability of correctly predicting the word wi given the word wj is defined as:

where V is the number of words in the vocabulary.
Similar condition for CBOW model.
(Please feel free to read Word2Vec for CBOW and Skip-Gram model for word prediction/language modeling.)

2. Translation Matrix (TM)

2.1. Motivation

**Distributed word vector representations of numbers and animals in English (left) and Spanish (right)**

The word vectors are visualized using PCA.

The vector representations of similar words in different languages were related by a linear transformation.

e.g.: The word vectors for English numbers one to five and the corresponding Spanish words uno to cinco have similar geometric arrangements.
Thus, if we know the translation of one and four from English to Spanish, we can learn the transformation matrix that can help us to translate even the other numbers to Spanish.

2.2. Translation Matrix (TM)

Suppose we are given a set of word pairs and their associated vector representations {xi, zi} where i is from 1 to n.
xi is the vector from source language.
zi is the vector from target language.
The goal is to find a transformation matrix W such that Wxi approximates zi:

which is solved with stochastic gradient descent (SGD).

2.3. Inference

At the prediction time, for any given new word and its continuous vector representation x, it can be mapped to the other language space by computing z=Wx.
Then, the word whose representation is closest to z in the target language space, using cosine similarity as the distance metric, is the translated word.

3. Results

3.1. WMT11 Datasets & Google Translate (GT)

**The sizes of the monolingual training datasets from WMT11**

Monolingual data sets for English, Spanish and Czech languages, are used.
To obtain dictionaries between languages, the most frequent words from the monolingual source datasets are used, and these words are translated using on-line Google Translate (GT).
To measure the accuracy, the most frequent 5K words from the source language are used and their translations given GT as the training data for learning the Translation Matrix.
The subsequent 1K words in the source language and their translations are used as a test set.

3.2. Results

Authors mentioned that, in terms of speed, CBOW is usually faster. Thus, CBOW is used in the experiments.” But in the result section, authors mentioned they train the Skip-Gram model. (So, I don’t know which model they use, from the paper… Please tell me if you know.)

**Accuracy of the word translation methods using the WMT11 datasets**

Edit Distance (ED): uses morphological structure of words to find the translation.
Word Co-occurrence: based on counts uses similarity of contexts in which words appear.