Brief Review — An Unsupervised Sentence Embedding Method by Mutual Information Maximization
IS-BERT, Unsupervised Approach Using Mutual Information (MI)
An Unsupervised Sentence Embedding Method by Mutual Information Maximization
IS-BERT, by Singapore University of Technology and Design; DAMO Academy, Alibaba Group; and ZJU-UIUC Institue
2020 EMNLP, Over 180 Citations (Sik-Ho Tsang @ Medium)Sentence Embedding / Dense Text Retrieval
2017 [InferSent] 2018 [Universal Sentence Encoder (USE)] 2019 [Sentence-BERT (SBERT)] 2020 [Multilingual Sentence-BERT] [Retrieval-Augmented Generation (RAG)] [Dense Passage Retriever (DPR)] 2021 [Fusion-in-Decoder] [Augmented SBERT (AugSBERT)] 2024 [Multilingual E5]
==== My Other Paper Readings Are Also Over Here ====
- Sentence BERT (SBERT) is good at obtaining sentence embedding. However, it needs to be trained on corpus with high-quality labeled sentence pairs.
- In this paper, Info-Sentence BERT (IS-BERT) is proposed, which has a novel self-supervised learning objective based on mutual information (MI) maximization strategies to derive meaningful sentence embeddings in an unsupervised manner.
Outline
- IS-BERT
- Results
1. IS-BERT
1.1. Model
- BERT is used to encode an input sentence x to a length-l sequence of token embeddings h1, h2, …, hl.
- 1-D convolutional neural network (CNN) layers with different window (kernel) sizes are applied on top of these token embeddings to capture the n-gram local contextual dependencies of the input sentence.
- Formally, an n-gram embedding ci generated by a CNN with window size k is computed as:
- where hi:i+k-1 is the concatenation of the token embeddings within a window. f is ReLU.
The final local representation of a token is the concatenation of its representations obtained with different window sizes.
- Fθ is the encoding function consisting of BERT and CNNs with trainable parameters.
The global sentence representation of x denoted as Eθ(x) computed by applying a mean-over-time pooling layer on the token representations Fθ(x).
Both the sentence representation and token representations will be fed into a discriminator network to produce scores for MI estimation.
1.2. MI Maximization Learning
The learning objective is to maximize the mutual information (MI) between the global sentence representation Eθ(x) and each of its local token representation F(i)θ(x).
- The Jensen-Shannon (JS) estimator is used in this paper:
It takes all the pairs of a global sentence embedding and local token embeddings as input and generates corresponding scores to estimate the above I.
- x’ is the negative sample drawn from distribution ~P. sp is softplus activation.
The end-goal learning objective over the whole dataset X is:
- In practice, given a batch of sentences, each sentence and its local context representations are treated as positive examples, and all the local context representations from other sentences in this batch are treated as negative examples.
- By maximizing I, it will push the encoder to capture the unique information that is shared across all local segments of the input sentence while different from other sentences, which leads to expressive sentence representation.
2. Results
- Both the [CLS] and averaging BERT embeddings perform worse than averaging GloVe embeddings.
- Second, all supervised methods outperform other unsupervised baselines, which suggests that the knowledge obtained from supervised learning on NLI can be well transfered to these STS tasks.
On the other hand, on average, IS-BERT-NLI significantly outperforms other unsupervised baselines. It even outperforms InferSent trained on labeled SNLI and MultiNLI datasets in 5 out of 7 tasks.
- As expected, IS-BERT-NLI is in general inferior to these two supervised baselines, but IS-BERT-NLI also achieves performance comparable to them in certain scenarios.
- STS in Table 1 is not task-specific while AFS dataset is more task-specific. Models are trained without task or domain-specific labeled data.
IS-BERT-AFS clearly outperforms other models in this setting.
- Overall, supervised methods outperform unsupervised baselines. This indicates that pretraining sentence encoder with high-quality labeled data such as NLI is helpful in a supervised transfer learning setting.
- IS-BERT-task: IS-BERT is trained on each of the task-specific dataset (without label) to produce sentence embeddings, then used for training downstream classifiers.
IS-BERT-task is able to outperform other unsupervised baselines on 6 out of 7 tasks, and it is on par with InferSent and USE which are strong supervised baselines trained on NLI task. This demonstrates the effectiveness of the proposed model in learning domain-specific sentence embeddings.
- IS-BERT-STSb (ssl+ft): IS-BERT is first trained on the training set without label using the self-supervised learning objective. Then, it is fine-tuned on the labeled data with a regression objective.
BERT and SBERT performs similarly on this task. IS-BERT-STSb (ssl+ft) outperforms both baselines.