Review: Skip-Thought Vectors

Learnt Generic Skip-Thought Vectors for 8 Downstream Tasks

9 min readNov 13, 2021

**Skip-thought Model Overview** (Figure from Emotion Detection from Text Using Skip-thought Vectors)

In this story, Skip-Thought Vectors, by University of Toronto, Canadian Institute for Advanced Research, and Massachusetts Institute of Technology, is reviewed. The name, Skip-Thought, is suggested by Prof. Hinton. A question is considered in the paper:

Is there a task and a corresponding loss that will allow us to learn highly generic sentence representations?

In this paper:

A generic sentence embedding/representation is learnt by predicting the previous and the next sentences using the current sentence.
A simple vocabulary expansion method is proposed to encode words that were not seen as part of training, allowing to expand the vocabulary to a million words.
After training, the vectors are extracted and evaluated with linear models on 8 tasks: semantic relatedness, paraphrase detection, image-sentence ranking, question-type classification and 4 benchmark sentiment and subjectivity datasets.

This is a paper in 2015 NeurIPS with over 2300 citations. (Sik-Ho Tsang @ Medium) It is a kind of self-supervised learning. A pretext task is trained by predicting the previous and the next sentences. Then a downstream task is tested by training a linear classifier on top of the trained encoder where the encoder has no fine-tuning.

Outline

Skip-Thought Model
Encoder, Decoder, & Objective Function
Vocabulary Expansion
Evaluation & Training Details
Experimental Results

1. Skip-Thought Model

1.1. Goal

Given a tuple (si-1, si, si+1) of contiguous sentences, with si the i-th sentence of a book, the sentence si is encoded and tries to reconstruct the previous sentence si-1 and next sentence si+1.
In the above example, the input is the sentence triplet (I got back home. I could see the cat on the steps. This was strange.)
Unattached arrows are connected to the encoder output. Colors indicate which components share parameters. <eos> is the end of sentence token.
Let wti denote the t-th word for sentence si and let xti denote its word embedding.

1.2. Encoder-Decoder Model Framework

The skip-thought model uses the framework of encoder-decoder models. The model is in three parts: the encoder, decoder and objective function.
That is, an encoder maps words to a sentence vector and a decoder is used to generate the surrounding sentences.
An RNN encoder is with GRU activations.
An RNN decoder is with a conditional GRU are used.
This model combination is nearly identical to the RNN Encoder-Decoder except that GRU is used here.

2. Encoder, Decoder, & Objective Function

2.1. Encoder

An RNN encoder is with GRU activations.
Let w1i, …, wNi, be the words in sentence si where N is the number of words in the sentence. At each time step, the encoder produces a hidden state hti which can be interpreted as the representation of the sequence w1i , …, wti.
The hidden state hNi thus represents the full sentence.
To encode a sentence, we iterate the following sequence of equations (dropping the subscript i):

where -ht (bar ht) is the proposed state update at time t, zt is the update gate, rt is the reset gate, ⊙ denotes a component-wise product. Both update gates takes values between zero and one.

2.2. Decoder

An RNN decoder is with a conditional GRU are used.
The decoder is a neural language model which conditions on the encoder output hi.
The computation is similar to that of the encoder.
One decoder is used for the next sentence si+1 while a second decoder is used for the previous sentence si-1.
Separate parameters are used for each decoder with the exception of the vocabulary matrix V.
For example using the decoder for the next sentence si+1, let hti+1 denote the hidden state of the decoder at time t. Decoding involves iterating through the following sequence of equations (dropping the subscript i+1):

Given hti+1, the probability of word wti+1 given the previous t-1 words and the encoder vector is:

where vwti+1 denotes the row of V corresponding to the word of wti+1.

2.3. Objective Function

Given a tuple (si-1, si, si+1), the objective optimized is the sum of the log-probabilities for the forward and backward sentences conditioned on the encoder representation:

The left term is for previous sentence while the right term is for the next sentence.
The total objective is the above summed over all such training tuples.

3. Vocabulary Expansion

To expand our encoder’s vocabulary to words it has not seen during training, a trained model Word2Vec that induces word representations is utilized.
Let Vw2v denote the word embedding space of these word representations using Word2Vec, and let Vrnn denote the RNN word embedding space.
It is assumed the vocabulary of Vw2v is much larger than that of Vrnn. The goal is to construct a mapping f: Vw2v → Vrnn parameterized by a matrix W such that v’ = Wv for v ∈ Vw2v and v’ ∈ Vrnn.
A linear mappings between translation word spaces is learnt by using un-regularized L2 linear regression loss for the matrix W.
Thus, any words from Vw2v can now be mapped into Vrnn for encoding sentences.

4. Evaluation & Training Details

4.1. Evaluation

The capability of our encoder as a generic feature extractor is evaluated after training on the BookCorpus dataset.
Using the learned encoder as a feature extractor, extract skip-thought vectors for all sentences. If the task involves computing scores between pairs of sentences, compute component-wise features between pairs.
Train a linear classifier on top of the extracted features, with no additional fine-tuning or backpropagation through the skip-thoughts model.
Thus, the representation quality of the computed vectors can be directly evaluated.

4.2. Models

Two separate models are trained on our book corpus.

uni-skip: A unidirectional encoder with 2400 dimensions.
bi-skip: A bidirectional model with forward and backward encoders of 1200 dimensions each. The outputs are then concatenated to form a 2400 dimensional vector.
combine-skip: A combined model, consisting of the concatenation of the vectors from uni-skip and bi-skip, resulting in a 4800 dimensional vector.

After the models trained, vocabulary expansion is employed to map word embeddings into the RNN encoder space.
The skip-thought models are trained with a vocabulary size of 20,000 words. After removing multiple word examples from the Word2Vec CBOW model, this results in a vocabulary size of 930,911 words.
Thus even though the skip-thoughts model was trained with only 20,000 words, after vocabulary expansion we can now successfully encode 930,911 possible words.

5. Experimental Results

5.1. Semantic Relatedness

The experiment is on the SemEval 2014 Task 1: semantic relatedness SICK dataset [30]. Given two sentences, the goal is to produce a score of how semantically related these sentences are, based on human generated scores between 1 to 5 (with 5 is highly related).
The dataset comes with a predefined split of 4500 training pairs, 500 development pairs and 4927 testing pairs.

**Test set results on the SICK semantic relatedness subtask**

The evaluation metrics are Pearson’s r, Spearman’s ρ, and mean squared error MSE.
Though the dependency tree-LSTM obtains the best result, it relies on parsers whose training data is very expensive to collect and does not exist for all languages.

Given that the vectors are learnt using self-supervised skip thoughts model only a linear classifier is placed on top of the trained encoder. The proposed uni-skip, bi-skip, and combine-skip already give very good results.

Further, using features learned from an image-sentence embedding model on COCO gives an additional performance boost, resulting in a model that performs on par with the dependency tree-LSTM.

**Example predictions from the SICK test set**

The proposed model is able to accurately predict relatedness on many challenging cases.
On some examples, it fails to pick up on small distinctions that drastically change a sentence meaning, such as tricks on a motorcycle versus tricking a person on a motorcycle.

5.2. Paraphrase Detection

On this task, two sentences are given and one must predict whether or not they are paraphrases, on the Microsoft Research Paraphrase Corpus.
The training set consists of 4076 sentence pairs (2753 which are positive) and the test set has 1725 pairs (1147 are positive).

**Test set results on the Microsoft Paraphrase Corpus.**

Skip-thoughts alone outperform recursive nets with dynamic pooling when no hand-crafted features are used.
When other features are used, recursive nets with dynamic pooling works better.

When skip-thoughts are combined with some basic pairwise statistics, it becomes competitive with the state-of-the-art which incorporate much more complicated features and hand-engineering.

5.3. Image-Sentence Ranking

Microsoft COCO dataset, dataset of images with high-quality sentence descriptions, is used. Each image is annotated with 5 captions.
For image annotation, an image is presented and sentences are ranked based on how well they describe the query image.
The image search task is the reverse: given a caption, we retrieve images that are a good fit to the query.
The training set comes with over 80,000 images each with 5 captions. The development and test sets each contain 1000 images and 5000 captions.
Evaluation is performed using Recall@K, namely the mean number of images for which the correct caption is ranked within the top-K retrieved results.
Images are represented using 4096-dimensional VGGNet features from their 19-layer model.
For sentences, skip-thought vectors are extracted for each caption.
The training objective is a pairwise ranking loss that has been previously used by many other methods. The only difference is the scores are computed using only linear transformations of image and sentence inputs.

**COCO test-set results for image-sentence retrieval experiments**

The proposed model’s performance that is on par with both [32] and [33] except for R@1 on image annotation, where other methods perform much better.

The proposed model’s results indicate that skip-thought vectors are representative enough to capture image descriptions without having to learn their representations from scratch.

5.4. Classification Benchmark

5 datasets are used: movie review sentiment (MR), customer product reviews (CR), subjectivity/objectivity classification (SUBJ), opinion polarity (MPQA) and question-type classification (TREC).
On all datasets, skip-thought vectors are simply extracted and a logistic regression classifier is trained on top. 10-fold cross-validation is used for evaluation on the first 4 datasets, while TREC has a pre-defined train/test split.
The L2 penalty is tuned using cross-validation (and thus use a nested cross-validation for the first 4 datasets).

**Classification accuracies on several standard benchmarks**

On most tasks, skip-thoughts performs about as well as the bag-of-words baselines but fails to improve over methods whose sentence representations are learned directly for the task at hand.

The skip-thoughts-NB (Naïve Bayes) combination is effective, particularly on MR. This results in a very strong new baseline for text classification.

5.5. Visualizing Skip-Thoughts

**t-SNE embeddings** **of skip-thought vectors on different datasets.**

Even without the use of relatedness labels, skip-thought vectors learn to accurately capture this property.

Reference

[2015 NeurIPS] [Skip-Thought]
Skip-Thought Vectors

Natural Language Processing

Sequence Model: 2014 [GRU] [Doc2Vec]
Language Model: 2007 [Bengio TNN’07] 2013 [Word2Vec] [NCE] [Negative Sampling]
Sentence Embedding: 2015 [Skip-Thought]
Machine Translation: 2014 [Seq2Seq] [RNN Encoder-Decoder] 2015 [Attention Decoder/RNNSearch]
Image Captioning: 2015 [m-RNN] [R-CNN+BRNN] [Show and Tell/NIC]