Review: Distributed Representations of Words and Phrases and their Compositionality (Negative Sampling)

Negative Sampling To Speed Up the Language Model Learning

Two-dimensional PCA projection of the 1000-dimensional Skip-gram vectors of countries and their capital cities
  • By subsampling of the frequent words, significant speedup is obtained and more regular word representations are learnt.

Outline

  1. Skip-Gram Model in Word2Vec
  2. Negative Sampling
  3. Subsampling of Frequent Words
  4. Experimental Results

1. Skip-Gram Model in Word2Vec

  • In Word2Vec, skip-gram model is to find word representations that are useful for predicting the surrounding words in a sentence or a document.
  • More formally, given a sequence of training words w1, w2, w3, …, wT, the objective of the Skip-gram model is to maximize the average log probability:
  • where c is the size of the training context (which can be a function of the center word wt).
  • Larger c results in more training examples and thus can lead to a higher accuracy, at the expense of the training time.
  • The basic Skip-gram formulation defines p(w t+j | wt) using the softmax function:
  • where vw and v′w are the “input” and “output” vector representations of w, and W is the number of words in the vocabulary.
  • This formulation is impractical because the cost of computing ∇log p(wO|wI) is proportional to W, which is often large (10⁵–10⁷ terms).

2. Negative Sampling

  • There are two approaches proposed in this paper.
  • The first one is Hierarchical softmax, which is a computationally efficient approximation of the full softmax, with the use of binary Huffman tree. (But I would not focus on Hierarchical softmax in this story.)
  • The second one is negative sampling.
  • In this paper, Noise Contrastive Estimation (NCE) is simplified as Negative sampling (NEG), which is defined by the objective:
  • which is used to replace every log P(wO|wI) term in the Skip-gram objective.
  • The experiments in this paper indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can be as small as 2–5.
  • A number of choices can be made for Pn(w) and it is found that the unigram distribution U(w) raised to the 3/4rd power (i.e., U(w)^(3/4)/Z) outperformed significantly the unigram and the uniform distributions.

3. Subsampling of Frequent Words

  • In very large corpora, the most frequent words can easily occur hundreds of millions of times (e.g., “in”, “the”, and “a”). Such words usually provide less information value than the rare words.
  • To counter the imbalance between the rare and frequent words, a simple subsampling approach is used: each word wi in the training set is discarded with probability computed by the formula:
  • where f(wi) is the frequency of word wi and t is a chosen threshold, typically around 10^(−5). It aggressively subsamples words whose frequency is greater than t while preserving the ranking of the frequencies.

4. Experimental Results

4.1. NEG vs HS vs NCE

Accuracy of various Skip-gram 300-dimensional models on the analogical reasoning task
  • The above table shows that Negative Sampling (NEG) outperforms the Hierarchical Softmax (HS) on the analogical reasoning task, and has even slightly better performance than the Noise Contrastive Estimation (NCE).
  • The subsampling of the frequent words improves the training speed several times and makes the word representations significantly more accurate.

4.2. Learning Phrase

  • In Word2Vec, phrase (word num > 1) problems are not handled.
  • To learn vector representation for phrases, first words that appear frequently together, and infrequently in other contexts are found. For example, “New York Times” and “Toronto Maple Leafs” are replaced by unique tokens in the training data, while a bigram “this is” will remain unchanged.
  • A simple data-driven approach is used, where phrases are formed based on the unigram and bigram counts, using:
  • The δ is used as a discounting coefficient and prevents too many phrases consisting of very infrequent words to be formed. Thus, many reasonable phrases are formed while without greatly increasing the size of the vocabulary.
Examples of the analogical reasoning task for phrases
  • The above table shows examples of the five categories of analogies used in this task.
Accuracies of the Skip-gram models on the phrase analogy dataset
  • The models were trained on approximately one billion words from the news dataset.
  • The results show that while Negative Sampling achieves a respectable accuracy even with k=5, using k=15 achieves considerably better performance.
  • Surprisingly, while Hierarchical Softmax (HS) achieves lower performance when trained without subsampling, it became the best performing method when we downsampled the frequent words.
  • This shows that the subsampling can result in faster training and can also improve accuracy, at least in some cases.
  • The amount of the training data is increased by using a dataset with about 33 billion words. HS reached an accuracy of 72%.
  • HS achieved lower accuracy 66% when the size of the training dataset is reduced to 6B words.
Examples of the closest entities to the given short phrases, using two different models

4.3. Additive Compositionality

Vector compositionality using element-wise addition. Four closest tokens to the sum of two vectors are shown, using the best Skip-gram model
  • The Skip-gram representations exhibit another kind of linear structure that makes it possible to meaningfully combine words by an element-wise addition of their vector representations.
  • Thus, if “Volga River” appears frequently in the same sentence together with the words “Russian” and “river”, the sum of these two word vectors will result in such a feature vector that is close to the vector of “Volga River”.

4.4. SOTA Comparison

Examples of the closest tokens given various well known models and the Skip-gram model trained on phrases using over 30 billion training words
  • The Skip-gram models achieved the best performance with a huge margin. (No figures/Numbers provided in the paper.)
  • The big Skip-gram model trained on a large corpus visibly outperforms all the other models in the quality of the learned representations.
  • This can be attributed in part to the fact that this model has been trained on about 30 billion words, which is about two to three orders of magnitude more data than the typical size used in the prior work.
  • Interestingly, although the training set is much larger, the training time of the Skip-gram model is just a fraction of the time complexity required by the previous model architectures.

Reference

Natural Language Processing

My Other Previous Paper Readings

--

--

PhD, Researcher. I share what I learn. :) Reads: https://bit.ly/33TDhxG, LinkedIn: https://www.linkedin.com/in/sh-tsang/, Twitter: https://twitter.com/SHTsang3

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store