Review — fastText: Enriching Word Vectors with Subword Information

fastText, Using Subword-Based Bag-of-Words, Outperforms CBOW in Word2Vec

Sik-Ho Tsang
5 min readFeb 2, 2022
fastText (https://fasttext.cc/)

Enriching Word Vectors with Subword Information
fastText, by Facebook AI Research (FAIR)
2017 TACL, Over 7000 Citations (Sik-Ho Tsang @ Medium)
Natural Language Processing (NLP)

  • This is a limitation, especially for languages with large vocabularies and many rare words.
  • By considering subword units, and words are represented by a sum of its character n-grams. Models are trained on large corpora quickly, allows us to compute word representations for words that did not appear in the training data.

Outline

  1. Subword Model
  2. Experimental Results

1. Subword Model

In previous Word2Vec, Skip Gram and CBOW models based on words.

Now, in fastText, it is Skip Gram model based on subwords.

1.1. Subword

  • Each word w is represented as a bag of character n-gram. Special boundary symbols < and > are added at the beginning and end of words, allowing to know they are prefixes and suffixes.
  • The word w itself is also included in the set of its n-grams, to learn a representation for each word (in addition to character n-grams).
  • Taking the word where and n = 3 as an example, it will be represented by the character n-grams:
  • and the special sequence:
  • In practice, all the n-grams for n greater or equal to 3 and smaller or equal to 6 are extracted.

1.2. Scoring Function

  • Suppose that a dictionary of n-grams of size G is given.
  • Given a word w, let us denote by Gw ⊂ {1, …,G} the set of n-grams appearing in w.
  • A vector representation zg is associated to each n-gram g. A word is represented by the sum of the vector representations of its n-grams.
  • Thus, a scoring function is obtained :

This simple model allows sharing the representations across words, thus allowing to learn reliable representation for rare words.

1.3. Hashing Function

  • In order to bound the memory requirements of the model, a hashing function is used to map n-grams to integers in 1 to K.
  • Character sequences are hashed using the Fowler-Noll-Vo hashing function (specifically the FNV-1a variant). K=2.10⁶ below.

Ultimately, a word is represented by its index in the word dictionary and the set of hashed n-grams it contains.

1.4. Some Details

  • Wikipedia dataset is used.
  • The word vectors have dimension 300.
  • 5 negatives are sampled at random.
  • A context window of size c is used, and the size c is uniformly sampled between 1 and 5.
  • In order to subsample the most frequent words, a rejection threshold of 10^(−4) is used. When building the word dictionary, the words that appear at least 5 times are kept in the training set.
  • (Please read Word2Vec if interested.)

2. Experimental Results

2.1. Human Similarity Judgement

Correlation between human judgement and similarity scores on word similarity datasets.
  • sg and cbow: Skip Gram and CBOW in Word2Vec.
  • sisg: Subword Information Skip Gram
  • sisg-: sisg with out-of-vocabulary words as null vector.

sisg, which uses subword information, outperforms the baselines on all datasets except the English WS353 dataset.

  • Moreover, computing vectors for out-of-vocabulary words (sisg) is always at least as good as not doing so (sisg-).

2.2. Word Analogy Tasks

Accuracy of our model and baselines on word analogy tasks for Czech, German, English and Italian.
  • Word analogy questions, are of the form A is to B as C is to D, where D must be predicted by the models.

It is observed that morphological information significantly improves the syntactic tasks; sisg outperforms the baselines.

2.3. Comparison with Morphological Representations

Spearman’s rank correlation coefficient between human judgement and model scores for different methods using morphology to learn word representations.
  • The proposed simple approach performs well relative to techniques based on subword information obtained from morphological segmentors.

2.4. Effect of the Size of the Training Data

Influence of size of the training data on performance.
  • Wikipedia corpus isolates the first 1, 2, 5, 10, 20, and 50 percent of the data.

For all datasets, and all sizes, the proposed approach (sisg) performs better than the baseline.

2.5. Language Modeling

Test perplexity on the language modeling task, for 5 different languages.
  • The proposed model is a recurrent neural network (RNN) with 650 LSTM units, regularized with dropout (with probability of 0.5) and weight decay (regularization parameter of 10^(-5)). Batch size of 20 is used.
  • The test perplexity without using pre-trained word vectors (LSTM), with word vectors pre-trained without subword information (sg) and with the proposed vectors (sisg), are reported.

Initializing the lookup table of the language model with pre-trained word representations improves the test perplexity over the baseline LSTM.

2.6. Word Similarity for OOV Words

Nearest neighbors of rare words using the proposed representations and skipgram.

sisg has more reasonable choices.

Reference

[2017 TACL] [fastText]
Enriching Word Vectors with Subword Information

Natural Language Processing (NLP)

Language/Sequence Model: 2007 [Bengio TNN’07] 2013 [Word2Vec] [NCE] [Negative Sampling] 2014 [GloVe] [GRU] [Doc2Vec] 2015 [Skip-Thought] 2016 [GCNN/GLU] [context2vec] [Jozefowicz arXiv’16] [LSTM-Char-CNN] 2017 [TagLM] [CoVe] [MoE] [fastText] 2018 [GLUE] [T-DMCA] [GPT] [ELMo] 2019 [T64] [Transformer-XL] [BERT]
Machine Translation: 2014 [Seq2Seq] [RNN Encoder-Decoder] 2015 [Attention Decoder/RNNSearch] 2016 [GNMT] [ByteNet] [Deep-ED & Deep-Att] 2017 [ConvS2S] [Transformer] [MoE]
Image Captioning: 2015 [m-RNN] [R-CNN+BRNN] [Show and Tell/NIC] [Show, Attend and Tell]

My Other Previous Paper Readings

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.