Review — context2vec: Learning Generic Context Embedding with Bidirectional LSTM

Using Bidirectional LSTM Instead of Averaging in Word2Vec

A 2D illustration of context2vec’s embedded space and similarity metrics. Triangles and circles denote sentential context embeddings and target word embeddings, respectively

In this story, context2vec: Learning Generic Context Embedding with Bidirectional LSTM, (context2vec), by Bar-Ilan University, is briefly reviewed. In this paper:

  • A bidirectional LSTM is proposed for efficiently learning a generic context embedding function from large corpora.

This is a paper in 2016 CoNLL with over 400 citations. (Sik-Ho Tsang @ Medium)

Outline

  1. CBOW in Word2Vec
  2. Bidirectional LSTM in context2vec
  3. Experimental Results

1. CBOW in Word2Vec

CBOW in Word2Vec
  • In brief, CBOW represents the context around a target word as a simple average of the embeddings of the context words in a window around it.
  • The context window can be larger, e.g. extend to [-5, 5] to have 5 previous words and 5 future words.
  • Obviously, averaging is not good enough, the weighting should be depending on the contexts around the word.

2. Bidirectional LSTM in context2vec

context2vec architecture
  • A bidirectional LSTM recurrent neural network, feeding one LSTM network with the sentence words from left to right, and another from right to left.
  • The parameters of these two networks are completely separate, including two separate sets of left-to-right and right-to-left context word embeddings.
  • The LSTM output vector representing its left-to-right context (“John”) with the one representing its right-to-left context (“a paper”), are concatenated. With this, the relevant information in the sentential context can be captured:
  • Next, this concatenated vector is fed into a multi-layer perceptron (MLP) to be capable of representing non-trivial dependencies between the two sides of the context:
  • where MLP is two-layer MLP:
  • where the summation goes over each word token t in the training corpus and its corresponding (single) sentential context c, and σ is the sigmoid function. t1, …, tk are the negative samples, independently sampled from a smoothed version of the target words unigram distribution:
  • where 0⩽α<1 is a smoothing factor, which increases the probability of rare words for Negative Sampling.
  • Some details in the network:
context2vec hyperparameters

3. Experimental Results

3.1. MSCC Corpus Development Set

Development set results (iters+ denotes the best model found when running more training iterations with α = 0.75)
  • context2vec outperforms AWE, which is another SOTA approach.
  • Training the proposed models with more iterations and it is found that with 3 iterations over the ukWaC corpus and 10 iterations over the MSCC corpus, some further improvement can be obsereved.
Test set results (c2v is context2vec)
  • As shown above, context2vec substantially outperforms AWE across all benchmarks.
  • S-1/S-2 stand for the best/second-best prior result reported for the benchmark. context2vec either surpass or almost reach the state-of-the-art on all benchmarks.

3.2. Others

Top-5 closest target words to a few given target words
Closest target words to various sentential contexts
  • The above table illustrates context2vec’s sensitivity to long range dependencies, and both sides of the target word.

Reference

[2016 CoNLL] [context2vec]
context2vec: Learning Generic Context Embedding with Bidirectional LSTM

Natural Language Processing (NLP)

Language Model: 2007 [Bengio TNN’07] 2013 [Word2Vec] [NCE] [Negative Sampling] 2014 [GloVe] [GRU] [Doc2Vec] 2015 [Skip-Thought] 2016 [GCNN/GLU] [context2vec]
Machine Translation: 2014 [Seq2Seq] [RNN Encoder-Decoder] 2015 [Attention Decoder/RNNSearch] 2016 [GNMT] [ByteNet] [Deep-ED & Deep-Att] 2017 [ConvS2S] [Transformer]
Image Captioning: 2015 [m-RNN] [R-CNN+BRNN] [Show and Tell/NIC] [Show, Attend and Tell]

My Other Previous Paper Readings

PhD, Researcher. I share what I've learnt and done. :) My LinkedIn: https://www.linkedin.com/in/sh-tsang/, My Paper Reading List: https://bit.ly/33TDhxG

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

A Beginner Explains Machine Learning 101

Why Linear Regression is not suitable for Binary Classification

Parking spot detection using Mask-RCNN

How I Fell in Love with XGBoost Algorithm

Testing of Machine learning models

Exploring the Essence of SimCLR

Which song should I play next? — Content Based Music Recommender System

Deep Neural Networks. Practice. Part 1.

Get the Medium app

Sik-Ho Tsang

Sik-Ho Tsang

PhD, Researcher. I share what I've learnt and done. :) My LinkedIn: https://www.linkedin.com/in/sh-tsang/, My Paper Reading List: https://bit.ly/33TDhxG

More from Medium

Review — Character-Level Language Modeling with Deeper Self-Attention

Multitask Learning

Transformer break-down : Positional Encoding

paper summary: “BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation…