# Brief Review — Exploiting Similarities among Languages for Machine Translation

## Early Paper for Word Translation Using Translation Matrix (TM)

Exploiting Similarities among Languages for Machine Translation,, by Google Inc.,

Translation Matrix (TM)2013 arXiv, Over 1500 Citations(Sik-Ho Tsang @ Medium)

Natural Language Processing, NLP, Neural Machine Translation, NMT

- By training a
**translation matrix (TM)**to**match the words in source language to those in target language**, missing word and phrase entries can also be translated.

# Outline

**CBOW & Skip-Gram****Translation Matrix (TM)****Datasets & Results**

# 1. CBOW & Skip-Gram

- (If you know CBOW and Skip-Gram models in
**Word2Vec**, please skip this section.)

- Given a sequence of training words
*w*1,*w*2,*w*3, …,*wT*, the objective of the**Skip-gram**model is to**maximize the average log probability**:

- where
*k*is the size of the training window. **The probability of correctly predicting the word**is defined as:*wi*given the word*wj*

- where
is the*V***number of words in the vocabulary**. - Similar condition for CBOW model.
- (Please feel free to read Word2Vec for CBOW and Skip-Gram model for word prediction/language modeling.)

**2. **Translation Matrix (TM)

## 2.1. Motivation

- The word vectors are visualized using PCA.

The vector representations of

similar words in different languageswere related by alinear transformation.

- e.g.: The word vectors for
**English numbers one to five**and the corresponding**Spanish words uno to cinco**have**similar geometric arrangements.** - Thus,
**if we know**the translation of**one and four from English to Spanish**, we can**learn the transformation matrix**that**can help us to translate even the other numbers to Spanish.**

## 2.2. Translation Matrix (TM)

- Suppose we are given a set of word pairs and their associated vector representations {
*xi,**zi*} where*i*is from 1 to*n*. is the vector from*xi***source**language.is the vector from*zi***target**language.- The
**goal**is to**find a transformation matrix**:*W*such that*Wxi*approximates*zi*

- which is solved with stochastic gradient descent (SGD).

## 2.3. Inference

- At the prediction time,
**for any given new word and its continuous vector representation**, it can be mapped to the other language space by*x***computing**.*z*=*Wx* - Then, the word whose representation is
**closest to**in the target language space, using*z***cosine similarity**as the distance metric,**is the translated word**.

# 3. Results

## 3.1. WMT11 Datasets & Google Translate (GT)

- Monolingual data sets for English, Spanish and Czech languages, are used.
- To obtain dictionaries between languages,
**the most frequent words from the monolingual source****datasets**are used, and these words are**translated using**on-line**Google Translate (GT)**. - To measure the accuracy,
**the most frequent 5K words**from the source language are used and their translations given GT as the**training data**for**learning the Translation Matrix**. **The subsequent 1K words**in the source language and their translations are used as a**test set**.

## 3.2. Results

- Authors mentioned that, in terms of speed, CBOW is usually faster. Thus, CBOW is used in the experiments.” But in the result section, authors mentioned they train the Skip-Gram model. (So, I don’t know which model they use, from the paper… Please tell me if you know.)

**Edit Distance (ED)**: uses morphological structure of words to find the translation.**Word Co-occurrence**: based on counts uses similarity of contexts in which words appear.

Translation Matrix (TM)has much higher accuracy.

The above figure shows how the

performance improves as the amount of monolingual data increases.

The above table shows some of the

translation example.

(While I am reading a language model paper, I accidentally dig out this paper to read.)

## Reference

[2013 arXiv] [Translation Matrix (TM)]

Exploiting Similarities among Languages for Machine Translation

## 4.2. Machine Translation

**2013 **[Translation Matrix (TM)] … **2020 **[Batch Augment, BA] [GPT-3] [T5] [Pre-LN Transformer] [OpenNMT] **2021 **[ResMLP] [GPKD]