Brief Review — Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval

Approximate nearest neighbor Negative Contrastive Learning (ANCE)

4 min readOct 22, 2024

Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval
ANCE, by Microsoft Corporation
2021 ICLR, Over 1000 Citations (Sik-Ho Tsang @ Medium)
Sentence Embedding / Dense Text Retrieval
2017 [InferSent] 2018 [Universal Sentence Encoder (USE)] 2019 [Sentence-BERT (SBERT)] 2020 [Multilingual Sentence-BERT] [Retrieval-Augmented Generation (RAG)] [Dense Passage Retriever (DPR)] [IS-BERT] 2021 [Fusion-in-Decoder] [Augmented SBERT (AugSBERT)] [SimCSE] 2024 [Multilingual E5]
==== My Other Paper Readings Are Also Over Here ====

The bottleneck of dense retrieval (DR) is the domination of uninformative negatives sampled in mini-batch training.
Approximate nearest neighbor Negative Contrastive Learning (ANCE) is proposed, which selects hard training negatives globally from the entire corpus.

Outline

Preliminaries
ANCE
Results

1. Preliminaries

1.1. Dense Retrieval (DR)

Given a query q and a corpus C, the first stage retrieval is to find a set of documents relevant to the query D+ = {d1, …, dn}. index, Dense Retrieval (DR) calculates the retrieval score f() using similarities in a learned embedding space:

where g() is the representation model. The similarity function (sim()) is often simply cosine or dot product to leverage efficient approximate nearest neighbor (ANN) retrieval.
A standard instantiation is to use the BERT-Siamese/two-tower/dual-encoder model:

It encodes the query and document separately with BERT as the encoder g(), using their last layer’s [CLS] token representation, and applied dot product (.) on them.

1.2. Negative Sampling

The effectiveness of DR resides in learning a good representation space that maps query and relevant documents together, while separating irrelevant ones.
Given a query q, a set of its relevant document D+q and irrelevant ones D-q, find the best θ* that:

The loss l() can be binary cross entropy (BCE), hinge loss, or negative log likelihood (NLL).
A unique challenge in dense retrieval, targeting first stage retrieval, is that the irrelevant documents to separate are from the entire corpus. This often leads to millions of negative instances, which have to be sampled in training:

2. ANCE

2.1. Top Retrieved Documents by ANN Index

Approximate nearest neighbor Negative Contrastive Estimation (ANCE), which selects negatives from the entire corpus using an asynchronously updated ANN index. ANCE samples negatives from the top retrieved documents via the DR model from the ANN index:

where:

By definition, the above D-ANCE are the hardest negatives for the current DR model.

2.2. Model

BERT-Siamese is used, with shared encoder weights between q and d and negative log likelihood (NLL) loss.

2.3. Asynchronous Index Refresh

An asynchronous index refresh approach is used, the ANN index is updated once every m batches, i.e., with checkpoint fk while in parallel, the Trainer continues its stochastic learning using D-fk-1 from ANN_fk-1.