Brief Review — Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval

Approximate nearest neighbor Negative Contrastive Learning (ANCE)

Sik-Ho Tsang
4 min readJust now

Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval
ANCE
, by Microsoft Corporation
2021 ICLR, Over 1000 Citations (Sik-Ho Tsang @ Medium)

Sentence Embedding / Dense Text Retrieval
2017
[InferSent] 2018 [Universal Sentence Encoder (USE)] 2019 [Sentence-BERT (SBERT)] 2020 [Multilingual Sentence-BERT] [Retrieval-Augmented Generation (RAG)] [Dense Passage Retriever (DPR)] [IS-BERT] 2021 [Fusion-in-Decoder] [Augmented SBERT (AugSBERT)] [SimCSE] 2024 [Multilingual E5]
==== My Other Paper Readings Are Also Over Here ====

  • The bottleneck of dense retrieval (DR) is the domination of uninformative negatives sampled in mini-batch training.
  • Approximate nearest neighbor Negative Contrastive Learning (ANCE) is proposed, which selects hard training negatives globally from the entire corpus.

Outline

  1. Preliminaries
  2. ANCE
  3. Results

1. Preliminaries

1.1. Dense Retrieval (DR)

  • Given a query q and a corpus C, the first stage retrieval is to find a set of documents relevant to the query D+ = {d1, …, dn}. index, Dense Retrieval (DR) calculates the retrieval score f() using similarities in a learned embedding space:
  • where g() is the representation model. The similarity function (sim()) is often simply cosine or dot product to leverage efficient approximate nearest neighbor (ANN) retrieval.
  • A standard instantiation is to use the BERT-Siamese/two-tower/dual-encoder model:
  • It encodes the query and document separately with BERT as the encoder g(), using their last layer’s [CLS] token representation, and applied dot product (.) on them.

1.2. Negative Sampling

  • The effectiveness of DR resides in learning a good representation space that maps query and relevant documents together, while separating irrelevant ones.
  • Given a query q, a set of its relevant document D+q and irrelevant ones D-q, find the best θ* that:
  • The loss l() can be binary cross entropy (BCE), hinge loss, or negative log likelihood (NLL).
  • A unique challenge in dense retrieval, targeting first stage retrieval, is that the irrelevant documents to separate are from the entire corpus. This often leads to millions of negative instances, which have to be sampled in training:

2. ANCE

2.1. Top Retrieved Documents by ANN Index

  • Approximate nearest neighbor Negative Contrastive Estimation (ANCE), which selects negatives from the entire corpus using an asynchronously updated ANN index. ANCE samples negatives from the top retrieved documents via the DR model from the ANN index:
  • where:
  • By definition, the above D-ANCE are the hardest negatives for the current DR model.

2.2. Model

  • BERT-Siamese is used, with shared encoder weights between q and d and negative log likelihood (NLL) loss.

2.3. Asynchronous Index Refresh

Asynchronous Index Refresh

An asynchronous index refresh approach is used, the ANN index is updated once every m batches, i.e., with checkpoint fk while in parallel, the Trainer continues its stochastic learning using D-fk-1 from ANN_fk-1.

3. Results

TREC 2019 Deep Learning Track

ANCE significantly outperforms all sparse retrieval.

OpenQA Experiments: Natural Questions (NQ) and Trivial QA (TQA)

In OpenQA, ANCE outperforms DPR and its fusion with BM25 (DPR+BM25) in retrieval accuracy.

OpenQA Test Scores

ANCE also improves end-to-end QA accuracy, using the same readers with previous state-of-the-arts but ANCE retriever.

Effectiveness

ANCE’s effectiveness is even more observed in real production.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.