Brief Review — Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Sentence-BERT (SBERT) for Sentence Similarity Search

Sik-Ho Tsang
4 min readDec 18, 2023

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Sentence-BERT (SBERT)
, by Technische Universit¨at Darmstadt
2019 EMNLP, Over 7900 Citations (Sik-Ho Tsang @ Medium)

Dense Text Retrieval
2020 [Retrieval-Augmented Generation (RAG)]
==== My Other Paper Readings Are Also Over Here ====

  • Originally, with BERT, finding the most similar pair in a collection of 10,000 sentences requires about 50 million inference computations (~65 hours).
  • In this paper, Sentence-BERT (SBERT) is proposed, which is a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity.
  • SBERT reduces the similarity search time to about 5 seconds.


  1. Sentence-BERT (SBERT)
  2. Results

1. Sentence-BERT (SBERT)

  • SBERT adds a pooling operation to the output of BERT / RoBERTa to derive a fixed sized sentence embedding.
  • 3 pooling strategies: Using the output of the CLS-token, computing the mean of all output vectors (MEAN-strategy), and computing a max-over-time of the output vectors (MAX-strategy). The default configuration is MEAN.

1.1. Classification Objective Function (Figure 1)

  • The sentence embeddings u and v are concatenated with the element-wise difference |u-v| and multiplied with the trainable weight Wt, and softmax:
  • Cross-entropy loss is used.

1.2. Regression Objective Function (Figure 2)

  • The cosine similarity between the two sentence embeddings u and v is computed. Mean-squared-error loss is used as the objective function.

1.3. Triplet Objective Function

  • Given an anchor sentence a, a positive sentence p, and a negative sentence n, triplet loss tunes the network such that the distance between a and p is smaller than the distance between a and n.
  • ||.|| is a distance metric. In this paper, Euclidean distance is used.

2. Results

2.1. STS

Semantic Textual Similarity (STS) tasks
  • Directly using the output of BERT leads to rather poor performances. Averaging the BERT embeddings achieves an average correlation of only 54.81, and using the CLStoken output only achieves an average correlation of 29.19. Both are worse than computing average GloVe embeddings.

Using the described siamese network structure and fine-tuning mechanism substantially improves the correlation.

STS benchmark test set.

Only training on STSb, and first training on NLI, then training on STSb. It is observed that the later strategy leads to a slight improvement of 1–2 points.

2.2. AFS


Training SBERT in the 10-fold cross-validation setup gives a performance that is nearly on-par with BERT.

However, in the cross-topic evaluation, it is observed that a performance drop of SBERT by about 7 points Spearman correlation.

2.3. Wikipedia


SBERT clearly outperforms the BiLSTM approach by Dor et al.

2.4. SentEval

SentEval toolkit

SBERT is able to achieve the best performance in 5 out of 7 tasks. The average performance increases by about 2 percentage points compared to InferSent as well as the Universal Sentence Encoder.

2.5. Ablation Study

Ablation Study

The most important component is the element-wise difference |u-v|.

2.6. Computational Efficiency

Computational Speed

SBERT with smart batching is about 9% faster than InferSent and about 55% faster than Universal Sentence Encoder. Smart batching achieves a speed-up of 89% on CPU and 48% on GPU.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.