Review: Distributed Representations of Words and Phrases and their Compositionality (Negative Sampling)
Negative Sampling To Speed Up the Language Model Learning
In this story, Distributed Representations of Words and Phrases and their Compositionality, (Negative Sampling), by Google, is reviewed. In this paper:
- By subsampling of the frequent words, significant speedup is obtained and more regular word representations are learnt.
This is a paper in 2013 NeurIPS with over 30000 citations. (Sik-Ho Tsang @ Medium) This paper is the follow up work, Word2Vec, in 2013 ICLR. The idea is similar to the Noise Contrastive Estimation (NCE) paper.
Outline
- Skip-Gram Model in Word2Vec
- Negative Sampling
- Subsampling of Frequent Words
- Experimental Results
1. Skip-Gram Model in Word2Vec
- In Word2Vec, skip-gram model is to find word representations that are useful for predicting the surrounding words in a sentence or a document.
- More formally, given a sequence of training words w1, w2, w3, …, wT, the objective of the Skip-gram model is to maximize the average log probability:
- where c is the size of the training context (which can be a function of the center word wt).
- Larger c results in more training examples and thus can lead to a higher accuracy, at the expense of the training time.
- The basic Skip-gram formulation defines p(w t+j | wt) using the softmax function:
- where vw and v′w are the “input” and “output” vector representations of w, and W is the number of words in the vocabulary.
- This formulation is impractical because the cost of computing ∇log p(wO|wI) is proportional to W, which is often large (10⁵–10⁷ terms).
That is, the denominator in softmax needs to sum up all exponential values for all words in the vocabulary. When there are millions of words, the process is slow.
2. Negative Sampling
- There are two approaches proposed in this paper.
- The first one is Hierarchical softmax, which is a computationally efficient approximation of the full softmax, with the use of binary Huffman tree. (But I would not focus on Hierarchical softmax in this story.)
- The second one is negative sampling.
- In this paper, Noise Contrastive Estimation (NCE) is simplified as Negative sampling (NEG), which is defined by the objective:
- which is used to replace every log P(wO|wI) term in the Skip-gram objective.
Thus the task is to distinguish the target word wO from draws from the noise distribution Pn(w) using logistic regression, where there are k negative samples for each data sample.
- The experiments in this paper indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can be as small as 2–5.
The main difference between the Negative sampling and NCE is that NCE needs both samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples.
And while NCE approximately maximizes the log probability of the softmax, this property is not important for the application in this paper.
- A number of choices can be made for Pn(w) and it is found that the unigram distribution U(w) raised to the 3/4rd power (i.e., U(w)^(3/4)/Z) outperformed significantly the unigram and the uniform distributions.
3. Subsampling of Frequent Words
- In very large corpora, the most frequent words can easily occur hundreds of millions of times (e.g., “in”, “the”, and “a”). Such words usually provide less information value than the rare words.
- To counter the imbalance between the rare and frequent words, a simple subsampling approach is used: each word wi in the training set is discarded with probability computed by the formula:
- where f(wi) is the frequency of word wi and t is a chosen threshold, typically around 10^(−5). It aggressively subsamples words whose frequency is greater than t while preserving the ranking of the frequencies.
It accelerates learning and even significantly improves the accuracy of the learned vectors of the rare words.
4. Experimental Results
4.1. NEG vs HS vs NCE
- The above table shows that Negative Sampling (NEG) outperforms the Hierarchical Softmax (HS) on the analogical reasoning task, and has even slightly better performance than the Noise Contrastive Estimation (NCE).
- The subsampling of the frequent words improves the training speed several times and makes the word representations significantly more accurate.
4.2. Learning Phrase
- In Word2Vec, phrase (word num > 1) problems are not handled.
- To learn vector representation for phrases, first words that appear frequently together, and infrequently in other contexts are found. For example, “New York Times” and “Toronto Maple Leafs” are replaced by unique tokens in the training data, while a bigram “this is” will remain unchanged.
- A simple data-driven approach is used, where phrases are formed based on the unigram and bigram counts, using:
- The δ is used as a discounting coefficient and prevents too many phrases consisting of very infrequent words to be formed. Thus, many reasonable phrases are formed while without greatly increasing the size of the vocabulary.
- The above table shows examples of the five categories of analogies used in this task.
- The models were trained on approximately one billion words from the news dataset.
- The results show that while Negative Sampling achieves a respectable accuracy even with k=5, using k=15 achieves considerably better performance.
- Surprisingly, while Hierarchical Softmax (HS) achieves lower performance when trained without subsampling, it became the best performing method when we downsampled the frequent words.
- This shows that the subsampling can result in faster training and can also improve accuracy, at least in some cases.
- The amount of the training data is increased by using a dataset with about 33 billion words. HS reached an accuracy of 72%.
- HS achieved lower accuracy 66% when the size of the training dataset is reduced to 6B words.
4.3. Additive Compositionality
- The Skip-gram representations exhibit another kind of linear structure that makes it possible to meaningfully combine words by an element-wise addition of their vector representations.
- Thus, if “Volga River” appears frequently in the same sentence together with the words “Russian” and “river”, the sum of these two word vectors will result in such a feature vector that is close to the vector of “Volga River”.
4.4. SOTA Comparison
- The Skip-gram models achieved the best performance with a huge margin. (No figures/Numbers provided in the paper.)
- The big Skip-gram model trained on a large corpus visibly outperforms all the other models in the quality of the learned representations.
- This can be attributed in part to the fact that this model has been trained on about 30 billion words, which is about two to three orders of magnitude more data than the typical size used in the prior work.
- Interestingly, although the training set is much larger, the training time of the Skip-gram model is just a fraction of the time complexity required by the previous model architectures.
There is also another document explaining about Skip Gram and Negative Sampling called, “word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method”, which got over 1400 citations.
Reference
[2013 NeurIPS] [Negative Sampling]
Distributed Representations of Words and Phrases and their Compositionality
Natural Language Processing
Language Model: 2007 [Bengio TNN’07] 2013 [Word2Vec] [NCE] [Negative Sampling]
Machine Translation: 2014 [Seq2Seq] [RNN Encoder-Decoder] 2015 [Attention Decoder/RNNSearch]
Image Captioning: 2015 [m-RNN] [R-CNN+BRNN] [Show and Tell/NIC]