Review: Learning Word Embeddings Efficiently with Noise-Contrastive Estimation (NCE)

Using Noise-Contrastive Estimation (NCE) for Efficient Learning

Sik-Ho Tsang
5 min readOct 30, 2021

In this story, Learning Word Embeddings Efficiently with Noise-Contrastive Estimation, (NCE), by DeepMind, is briefly reviewed.

  • Continuous-valued word embeddings learned by neural language models have recently been shown to capture semantic and syntactic information. However, large amount of data is needed to train which makes the training time long.

In this paper:

  • A simple and scalable Noise-Contrastive Estimation is used to shorten the training time.

This is a paper in 2013 NeurIPS with over 600 citations. (Sik-Ho Tsang @ Medium) Noise-Contrastive Estimation (NCE) is another basic concept of Contrastive Learning in self-supervised learning.


  1. Softmax in Word2Vec
  2. Noise-Contrastive Estimation (NCE)
  3. Experimental Results

1. Softmax in Word2Vec

  • In Word2Vec, Softmax is used at the end for CBOW and Skip-gram models:
  • However, the denominator needs to sum up all exponential values for all words in the vocabulary. When there are millions of words, the process is slow.

2. Noise-Contrastive Estimation (NCE)

2.1. NCE

NCE is based on the reduction of density estimation to probabilistic binary classification. The basic idea is to train a logistic regression classifier to discriminate between samples from the data distribution and samples from some “noise” distribution, based on the ratio of probabilities of the sample under the model and the noise distribution.

  • We would like to learn the distribution of words for some specific context h, denoted by Ph(w), a binary classifier is learnt to classify the positive samples, i.e. the training samples and the negative examples, i.e. the samples from a noise distribution Pn(w).
  • Assuming that noise samples are k times more frequent than data samples, the probability that the given sample came from the data is (Correct labels D=1):
  • where σ(x) is the logistic function and Δ(w, h) = s(w, h)-log(k Pn(w)) is the difference in the scores of word w under the model and the (scaled) noise distribution.
  • And s(w, h) is the network model used in this paper.

By using NCE, the summation originated at the denominator of the Softmax function can be skipped. Because NCE is an unnormalized model.

NCE training time linear in the number of noise samples and independent of the vocabulary size.

As we increase the number of noise samples k, this estimate approaches the likelihood gradient of the normalized model.

  • The model is optimized by maximizing the log-posterior probability of the correct labels D averaged over the data and noise samples:
  • where the expectation over the noise distribution is approximated by sampling. And the gradient is:
  • where the gradient involves a sum over k noise samples instead of a sum over the entire vocabulary, making the NCE training time linear in the number of noise samples and independent of the vocabulary size.

2.2. Log-Bilinear Language (LBL) Models

  • (The models here are not the main point to learn NCE.)
  • In this paper, the log-bilinear language model (LBL) which was proposed by Prof. Hinton in 2007, is used as baseline for development.
  • It is a very simple neural language model. The LBL model performs linear prediction in the word feature vector space and does not have non-linearities, instead of using any hidden layers.
  • Vector LBL (vLBL) is the model to learn words based on the surrounding words.
  • ivLBL is the inverse of LBL, which predicts words that are surrounding the current word. It is similar to Skip-gram model in Word2Vec but without position-dependent weights.

3. Experimental Results

  • Two analogy-based word similarity tasks released by Google and Microsoft Research (MSR) are tested. (Please feel free to read the datasets details and experimental setup in the paper.)
Accuracy in percent on word similarity tasks
  • NCEk denotes NCE training using k noise samples.
  • All NCE-trained models outperformed the Skip-gram in Word2Vec. Accuracy steadily increased with the number of noise samples used, as did the training time.
  • The best compromise between running time and performance seems to be achieved with 5 or 10 noise samples.
Accuracy in percent on word similarity tasks for large models
  • The 300D ivLBL model trained for just over a day, achieves accuracy scores 3–9 percentage points better than the 300D Skip-gram trained for almost twice as long.
  • The same model trained for four days achieves accuracy scores that are only 2–4 percentage points lower than those of the 1000D Skip-gram trained on four times as much data using 75 times as many CPU cycles.
Results for various models trained for 20 epochs on the 47M-word Gutenberg dataset using NCE5 with AdaGrad ((D) and (I) denote models with and without position-dependent weights respectively)
  • Surprisingly, the results show that representations learned with position-independent weights, designated with (I), tend to perform better than the ones learned with position-dependent weights.
Accuracy on the MSR Sentence Completion Challenge dataset
  • Even the model with the lowest embedding dimensionality of 100, achieves 51.0% correct, compared to 48.0% correct reported in Word2Vec for the Skip-gram model with 640D embeddings.
  • The 55.5% correct achieved by the model with 600D embeddings is also better than the best single-model score on this dataset in the literature (54.7% in [14]).

Another work, which is “Distributed Representations of Words and Phrases and their Compositionality” in 2013 NeurIPS which proposes the negative sampling. The idea is very similar to NCE, which will be talked about later.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.