Review — context2vec: Learning Generic Context Embedding with Bidirectional LSTM

Using Bidirectional LSTM Instead of Averaging in

Sik-Ho Tsang
4 min readDec 4, 2021

--

A 2D illustration of context2vec’s embedded space and similarity metrics. Triangles and circles denote sentential context embeddings and target word embeddings, respectively

In this story, context2vec: Learning Generic Context Embedding with Bidirectional LSTM, (context2vec), by Bar-Ilan University, is briefly reviewed. In this paper:

  • A bidirectional LSTM is proposed for efficiently learning a generic context embedding function from large corpora.

This is a paper in 2016 CoNLL with over 400 citations. (Sik-Ho Tsang @ Medium)

Outline

  1. CBOW in
  2. Bidirectional LSTM in context2vec
  3. Experimental Results

1. CBOW in

CBOW in
  • In brief, CBOW represents the context around a target word as a simple average of the embeddings of the context words in a window around it.
  • The context window can be larger, e.g. extend to [-5, 5] to have 5 previous words and 5 future words.
  • Obviously, averaging is not good enough, the weighting should be depending on the contexts around the word.

2. Bidirectional LSTM in context2vec

context2vec architecture
  • A bidirectional LSTM recurrent neural network, feeding one LSTM network with the sentence words from left to right, and another from right to left.
  • The parameters of these two networks are completely separate, including two separate sets of left-to-right and right-to-left context word embeddings.
  • The LSTM output vector representing its left-to-right context (“John”) with the one representing its right-to-left context (“a paper”), are concatenated. With this, the relevant information in the sentential context can be captured:
  • Next, this concatenated vector is fed into a multi-layer perceptron (MLP) to be capable of representing non-trivial dependencies between the two sides of the context:
  • where MLP is two-layer MLP:
  • where the summation goes over each word token t in the training corpus and its corresponding (single) sentential context c, and σ is the sigmoid function. t1, …, tk are the negative samples, independently sampled from a smoothed version of the target words unigram distribution:
  • where 0⩽α<1 is a smoothing factor, which increases the probability of rare words for .
  • Some details in the network:
context2vec hyperparameters

3. Experimental Results

3.1. MSCC Corpus Development Set

Development set results (iters+ denotes the best model found when running more training iterations with α = 0.75)
  • context2vec outperforms AWE, which is another SOTA approach.
  • Training the proposed models with more iterations and it is found that with 3 iterations over the ukWaC corpus and 10 iterations over the MSCC corpus, some further improvement can be obsereved.
Test set results (c2v is context2vec)
  • As shown above, context2vec substantially outperforms AWE across all benchmarks.
  • S-1/S-2 stand for the best/second-best prior result reported for the benchmark. context2vec either surpass or almost reach the state-of-the-art on all benchmarks.

3.2. Others

Top-5 closest target words to a few given target words
Closest target words to various sentential contexts
  • The above table illustrates context2vec’s sensitivity to long range dependencies, and both sides of the target word.

Reference

[2016 CoNLL] [context2vec]

Natural Language Processing (NLP)

Language Model: 2007 [] 2013 [] [] [] 2014 [] [] [] 2015 [] 2016 [] []
Machine Translation: 2014 [] [] 2015 [] 2016 [] [] [] 2017 [] []
Image Captioning: 2015 [] [] [] []

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.

No responses yet

Write a response