Brief Review — Supervised Learning of Universal Sentence Representations from Natural Language Inference Data
InferSent Sentence Embedding
Supervised Learning of Universal Sentence Representations from Natural Language Inference Data
InferSent, by Facebook AI Research, and Université Le Mans
2017 EMNLP, Over 2500 Citations (Sik-Ho Tsang @ Medium)Sentense Embedding / Dense Text Retrieval
2019 [Sentence-BERT (SBERT)] 2020 [Retrieval-Augmented Generation (RAG)] [Dense Passage Retriever (DPR)] 2021 [Fusion-in-Decoder] [Augmented SBERT (AugSBERT)]
==== My Other Paper Readings Are Also Over Here ====
- In this paper, authors look for what architecture and what way to obtain universal sentence representations trained using the supervised data of the Stanford Natural Language Inference (SNLI) datasets, and transfer to a wide range of tasks.
- This is much like how computer vision uses ImageNet to obtain features, which can then be transferred to other tasks.
Outline
- InferSent Models & Pretraining
- Transferring & Results
1. InferSent
1.1. The Natural Language Inference (NLI) Task
- The SNLI dataset consists of 570k humangenerated English sentence pairs, manually labeled with 1 of 3 categories: entailment, contradiction and neutral.
The semantic nature of NLI makes it a good candidate for learning universal sentence embeddings in a supervised way.
- Models can be trained on SNLI in two different ways: (i) sentence encoding-based models that explicitly separate the encoding of the individual sentences and (ii) joint methods that allow to use encoding of both sentences (to use cross-features or attention from one sentence to the other). The first setting is adopted here.
- As shown above, 3 matching methods are applied to extract relations between u and v: (i) concatenation of the two representations (u, v); (ii) element-wise product u✱v; and (iii) absolute element-wise difference |u−v|.
- The resulting vector, which captures information from both the premise and the hypothesis, is fed into a 3-class classifier consisting of multiple fully-connected layers culminating in a softmax layer.
1.2. Models
- 7 different architectures are evaluated: standard recurrent encoders with either Long Short-Term Memory (LSTM) or Gated Recurrent Units (GRU), concatenation of last hidden states of forward and backward GRU, Bi-directional LSTMs (BiLSTM) with either mean or max pooling, self-attentive network and hierarchical convolutional networks.
- LSTM and GRU: A sentence is represented by the last hidden vector, hT.
- BiGRU-last: concatenates the last hidden state of a forward GRU, and the last hidden state of a backward GRU to have the same architecture as for SkipThought vectors.
- BiLSTM with mean/max pooling: The concatenation of a forward LSTM and a backward LSTM:
- Two ways of combining the varying number of {ht}t to form a fixed-size vector are considered, either by selecting the maximum value over each dimension of the hidden units (max pooling) or by considering the average of the representations (mean pooling).
- Self-attentive network: The self-attentive sentence encoder uses an attention mechanism over the hidden states of a BiLSTM to generate a representation u of an input sentence. The attention mechanism is defined as:
- A self-attentive network with multiple views of the input sentence, so that the model can learn which part of the sentence is important for the given task.
- Concretely, 4 context vectors u1w, u2w, u3w, u4w are obtained which generate 4 representations that are then concatenated to obtain the sentence representation u.
- Hierarchical ConvNet: AdaSent (Zhao et al., 2015), which concatenates different representations of the sentences at different level of abstractions. Inspired by this architecture, a faster version consisting of 4 convolutional layers is introduced.
- At every layer, a representation ui is computed by a max-pooling operation over the feature maps (see Figure 4). The final representation u = [u1, u2, u3, u4] concatenates representations at different levels of the input sentence. The model thus captures hierarchical abstractions of an input sentence in a fixed-size representation.
1.3. Pretraining
- For the classifier, a multi-layer perceptron with 1 hidden-layer of 512 hidden units is used.
- GloVe vectors trained on Common Crawl 840B with 300 dimensions are used as fixed word embeddings.
2. Transferring & Results
2.1. 12 Downstream Tasks
- Binary and multi-class classification: including sentiment analysis (MR, SST), question-type (TREC), product reviews (CR), subjectivity/objectivity (SUBJ) and opinion polarity (MPQA). A logistic regression is trained on top.
- Entailment and semantic relatedness: SICK dataset for both entailment (SICK-E) and semantic relatedness (SICK-R). A logistic regression is trained on top.
- STS14 — Semantic Textual Similarity: Cosine distance between two sentences correlate with a human-labeled similarity score through Pearson and Spearman correlations.
- Paraphrase Detection: Sentence pairs have been human-annotated according to whether they capture a paraphrase/semantic equivalence relationship. A 2-class classifier is on top, similar to SICK-E.
- Caption-Image retrieval: The caption-image retrieval task evaluates joint image and language feature models. A pairwise ranking loss Lcir(x,y):
- After training, Recall@K is computed.
2.2. Results
- The BiLSTM-4096 with the max-pooling operation performs best on both SNLI and transfer tasks.
Increasing embedding sizes lead to increased performance for almost all models.
BiLSTM-Max consistently outperforms the results obtained by SkipThought vectors.
The pre-trained representations such as ResNet image features and the proposed sentence embeddings can achieve competitive results compared to features learned directly on the objective task.