Brief Review — SimCSE: Simple Contrastive Learning of Sentence Embeddings

Dropout Approach for Contrastive Learning of Sentence Embeddings

Sik-Ho Tsang
3 min read6 days ago

SimCSE: Simple Contrastive Learning of Sentence Embeddings
SimCSE
, by Princeton University, and Tsinghua University
2021 EMNLP, Over 2800 Citations (Sik-Ho Tsang @ Medium)

Sentence Embedding / Dense Text Retrieval
2017
[InferSent] 2018 [Universal Sentence Encoder (USE)] 2019 [Sentence-BERT (SBERT)] 2020 [Multilingual Sentence-BERT] [Retrieval-Augmented Generation (RAG)] [Dense Passage Retriever (DPR)] 2021 [Fusion-in-Decoder] [Augmented SBERT (AugSBERT)] 2024 [Multilingual E5]
==== My Other Paper Readings Are Also Over Here ====

  • Simple Contrastive Sentence Embedding (SimCSE) framework is proposed, which is an unsupervised approach. It takes an input sentence and predicts itself in a contrastive objective, with only standard Dropout used as noise.
  • This simple method works surprisingly well. And Dropout acts as minimal data augmentation and removing it leads to a representation collapse.

Outline

  1. SimCSE
  2. Results

1. SimCSE

1.1. Preliminaries: Contrastive Learning

  • Contrastive learning aims to learn effective representation by pulling semantically close neighbors together and pushing apart non-neighbors.
  • Let hi and h+i denote the representations of xi and x+i, the training objective for (xi, x+i) with a mini-batch of N pairs is:
  • (It is assumed that contrastive learning is well studied before reading this story.)

1.2. Unsupervised SimCSE

Unsupervised SimCSE and Supervised SimCSE
  • We have a collection of sentences {xi} where i is from 1 to m, and use x+i=xi.

To be clear, x+i is x^(+)_i for simplicity, as exampled in the above equation.

  • The key ingredient to get this to work with identical positive pairs is through the use of independently sampled Dropout masks for xi and x+i.
  • In standard training of Transformers, there are Dropout masks placed on fully-connected layers as well as attention probabilities.
  • Let z is a random mask for Dropout.

In SimCSE, the same input is fed to the encoder twice and get two embeddings with different Dropout masks z, z’, and the training objective is:

  • where N is a mini-batch size.

1.3. Supervised SimCSE

  • In this setting, supervised datasets are leveraged to provide better training signals.
  • The entailment pairs from the NLI (SNLI + MNLI) datasets are used where 2 sentences can be entailment, neutral or contradiction.
  • In NLI datasets, given one premise, annotators are required to manually write one sentence that is absolutely true (entailment), one that might be true (neutral), and one that is definitely false (contradiction). Therefore, for each premise and its entailment hypothesis, there is an accompanying contradiction hypothesis.

Thus, hard negatives can be added, which extends the above equation to the below one:

2. Results

SOTA Comparisons on 7 STS Tasks
  • Pre-trained checkpoints of BERT and RoBERTa are used.
  • [CLS] representation is used as the sentence embedding.
  • Unsupervised SimCSE is trained on 10⁶ randomly sampled sentences from English Wikipedia.
  • Supervised SimCSE is trained on the combination of MNLI and SNLI datasets (314k).

SimCSE can substantially improve results on all the datasets with or without extra NLI supervision, greatly outperforming the previous state-of-the-art models, e.g.: GloVe, BERT baseline and Sentence-BERT (SBERT).

Top-3 examples

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.