# Review: Learning Word Embeddings Efficiently with Noise-Contrastive Estimation (NCE)

## Using Noise-Contrastive Estimation (NCE) for Efficient Learning

In this story, **Learning Word Embeddings Efficiently with Noise-Contrastive Estimation**, (NCE), by DeepMind, is briefly reviewed.

- Continuous-valued
**word embeddings**learned by neural language models have recently been shown to**capture semantic and syntactic information**. However, large amount of data is needed to train which makes the**training time long**.

In this paper:

- A
**simple**and**scalable Noise-Contrastive Estimation**is used to shorten the training time.

This is a paper in **2013 NeurIPS **with over **600 citations**. (Sik-Ho Tsang @ Medium) **Noise-Contrastive Estimation (NCE)** is another basic concept of **Contrastive Learning **in **self-supervised learning**.

# Outline

**Softmax in****Word2Vec****Noise-Contrastive Estimation (NCE)****Experimental Results**

**1. **Softmax in Word2Vec

- In Word2Vec,
**Softmax**is used at the end for CBOW and Skip-gram models:

- However, the
**denominator**needs to**sum up all exponential values for all words in the vocabulary**. When there are millions of words, the process is**slow**.

# 2. Noise-Contrastive Estimation (NCE)

## 2.1. NCE

NCE is based on the reduction of density estimation to probabilistic binary classification. The basic idea is to train a logistic regression classifier to

discriminate between samples from the data distribution and samples from some “noise” distribution, based on the ratio of probabilities of the sample under the model and the noise distribution.

- We would like to learn the distribution of words for some specific context h, denoted by
*Ph*(*w*),**a binary classifier is learnt**to classify the**positive samples**, i.e. the**training samples**and the**negative examples**, i.e. the**samples from a noise distribution**.*Pn*(*w*) - Assuming that
**noise samples are**, the probability that the given sample came from the data is (Correct labels*k*times more frequent than data samples*D*=1):

- where
*σ*(*x*) is the logistic function and Δ*sθ*(*w*,*h*) =*s*(*w*,*h*)) is the*k Pn*(*w*)**difference**in the**scores of word**and*w*under the model**the (scaled) noise distribution**. - And
is the*s*(*w*,*h*)**network model**used in this paper.

By using NCE,

the summation originated at the denominator of the Softmax function can be skipped. BecauseNCEis anunnormalized model.

NCE training timelinear in the number of noise samples andindependent of the vocabulary size.As we

increase the number of noise samples, this estimatekapproaches the likelihood gradient of the normalized model.

- The model is optimized by maximizing the log-posterior probability of the correct labels
*D*averaged over the data and noise samples:

- where the expectation over the noise distribution is approximated by sampling. And the gradient is:

- where the gradient involves
**a sum over**, making the NCE training time linear in the number of noise samples and*k*noise samples instead of a sum over the entire vocabulary**independent of the vocabulary size**.

## 2.2. **Log-Bilinear Language (LBL) **Models

- (The models here are not the main point to learn NCE.)
- In this paper, the
**log-bilinear language model (LBL)**which was proposed by Prof. Hinton in 2007, is used as baseline for development. - It is a very simple neural language model. The LBL model performs
**linear prediction in the word feature vector space**and does not have non-linearities, instead of using any hidden layers. **Vector LBL (vLBL)**is the model to learn words based on the surrounding words.**ivLBL**is the**inverse of LBL**, which predicts words that are surrounding the current word. It is**similar to Skip-gram model in****Word2Vec****but without position-dependent weights**.

**3. Experimental Results**

**Two analogy-based word similarity tasks**released by**Google**and**Microsoft Research (MSR)**are tested. (Please feel free to read the datasets details and experimental setup in the paper.)

**NCE**denotes*k***NCE**training using.*k*noise samples**All NCE-trained models outperformed the Skip-gram in****Word2Vec**. Accuracy steadily increased with the number of noise samples used, as did the training time.- The
**best compromise**between running time and performance seems to be achieved with**5 or 10 noise samples**.

- The
**300D ivLBL model trained for just over a day**, achieves accuracy scores**3–9 percentage points better than the 300D Skip-gram trained for almost twice as long.** **The same model trained for four days**achieves accuracy scores that are**only 2–4 percentage points lower than those of the 1000D Skip-gram trained on four times as much data using 75 times as many CPU cycles**.

- Surprisingly, the results show that representations learned with position-independent weights, designated with (I), tend to perform better than the ones learned with position-dependent weights.

- Even the model with the lowest embedding
**dimensionality of 100**, achieves**51.0% correct**, compared to**48.0% correct**reported in Word2Vec for the**Skip-gram**model with**640D embeddings**. - The
**55.5% correct**achieved by the model with**600D embeddings**is also**better than the best single-model**score on this dataset in the literature (54.7% in [14]).

Another work, which is “Distributed Representations of Words and Phrases and their Compositionality” in 2013 NeurIPS which proposes the **negative sampling**. The idea is very similar to NCE, which will be talked about later.

## Reference

[2013 NeurIPS] [NCE]

Learning Word Embeddings Efficiently with Noise-Contrastive Estimation

**Language Model: 2007 **[Bengio TNN’07] **2013 **[Word2Vec] [NCE]**Machine Translation: 2014** [Seq2Seq] [RNN Encoder-Decoder]

**2015**[Attention Decoder/RNNSearch]

**Image Captioning:**

**2015**[m-RNN] [R-CNN+BRNN] [Show and Tell/NIC]