Brief Review — Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks

Augmented SBERT (AugSBERT)

Sik-Ho Tsang
4 min readApr 30, 2024
Augmented SBERT (AugSBERT)

Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks
Augmented SBERT (AugSBERT)
, by Technische Universität Darmstadt
2021 EMNLP, Over 180 Citations (Sik-Ho Tsang @ Medium)

Dense Text Retrieval
2019 [Sentence-BERT (SBERT)] 2020 [Retrieval-Augmented Generation (RAG)] [Dense Passage Retriever (DPR)] 2021 [Fusion-in-Decoder]
==== My Other Paper Readings Are Also Over Here ====

  • Cross-encoders alone, which perform full-attention over the input pair, often achieve higher performance, but they are too slow.
  • Bi-encoders, which map each input independently to a dense vector space, require substantial training data and fine-tuning over the target task to achieve competitive performance.
  • In this paper, a simple yet efficient data augmentation strategy called Augmented SBERT is proposed, where the cross-encoder is used to label a larger set of input pairs to augment the training data for the bi-encoder.


  1. Augmented SBERT (AugSBERT)
  2. Results

1. Augmented SBERT

Augmented SBERT

1.1. Overall Workflow

  • Given a pre-trained, well-performing cross-encoder, sentence pairs are sampled according to a certain sampling strategy and labelled using the cross-encoder. These weakly labeled examples are named as the silver dataset and they will be merged with the gold training dataset.
  • The bi-encoder is trained on this extended training dataset.
  • And it is referred as Augmented SBERT (AugSBERT).

1.2. Pair Sampling Strategies

  • Different sampling strategies are tested.
  1. Random Sampling (RS): Randomly sampled sentences, which easily includes negative samples as positive samples.
  2. Kernel Density Estimation (KDE): try to minimize KL Divergence between distributions using a sampling function which retains a sample with score s with probability Q(s), which means it has high complexity.
  3. BM25 Sampling (BM25): based on lexical overlap and is commonly used as a scoring function in many researches.
  4. Semantic Search Sampling (SS): A drawback of BM25 is that only sentences with lexical overlap can be found. Cosine-similarity is used for every sentence to retrieve the top k most similar sentences in the collection.
  5. BM25 + Semantic Search Sampling (BM25-S.S.): Both BM25 and Semantic Search (S.S.) sampling techniques simultaneously, which helps capture the lexical and semantically similar sentences but skews the label distribution towards negative pairs.

1.3. Domain Adaptation Using Augmented SBERT

Domain Adaptation Using Augmented SBERT
  • Annotated data for new domains is rarely available.
  • The proposed data augmentation strategy can be used for domain adaptation:
  1. The cross-encoder (BERT) is first fine-tuned over the source domain containing pairwise annotations.
  2. After fine-tuning, this fine-tuned cross-encoder is used to label the target domain.
  3. Once labeling is complete, the bi-encoder (SBERT) is trained over the labeled target domain sentence.

2. Results

2.1. Datasets

Summary of Datasets
  • Regression tasks assign a score to indicate the similarity between the inputs.
  • For classification tasks, distinct labels, for example, paraphrase vs. non-paraphrase, are used.
In-Domain Tasks

The proposed AugSBERT approach improves the performance for all tasks by 1 up to 6 points, significantly outperforming the existing bi-encoder SBERT and reducing the performance difference to the cross-encoder BERT.

  • It outperforms the synonym replacement data augmentation technique (NLPAug) for all tasks.

Given that BM25 is the most computationally efficient sampling strategy and also creates smaller silver datasets (numbers are given in Appendix F, Table 11), it is likely the best choice for practical applications.

Out-Of-Domain Tasks

In almost all combinations, AugSBERT outperforms SBERT trained on out-of-domain data (cross-domain). On the Sprint dataset (target), the improvement can be as large as 37 points. In few cases, AugSBERT even outperforms SBERT trained on gold in-domain target data.

  • It is observed that AugSBERT benefits a lot when the source domain is rather generic (e.g. Quora) and the target domain is rather specific (e.g. Sprint).



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.