Brief Review — DPR: Dense Passage Retrieval for Open-Domain Question Answering

Dense Passage Retriever (DPR)

Sik-Ho Tsang
5 min readDec 20, 2023
(Image from PyTorch Forum: Hari_Krishnan)

Dense Passage Retrieval for Open-Domain Question Answering
Dense Passage Retrieval (DPR)
, by Facebook AI, University of Washington, Princeton University
2020 EMNLP, Over 1900 Citations (Sik-Ho Tsang @ Medium)

Dense Text Retrieval
2019 [Sentence-BERT (SBERT)] 2020 [Retrieval-Augmented Generation (RAG)]
==== My Other Paper Readings Are Also Over Here ====

  • Retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dual-encoder framework.
  • When evaluated on a wide range of open-domain QA datasets, the dense retriever outperforms a strong Lucene-BM25 system greatly by 9%-19% absolute in terms of top-20 passage retrieval accuracy.


  1. Dense Passage Retriever (DPR)
  2. Results

1. Dense Passage Retriever (DPR)

1.1. Goal

Given a collection of M text passages, the goal of the dense passage retriever (DPR) is to index all the passages in a low-dimensional and continuous space, such that it can retrieve efficiently the top k passages relevant to the input question for the reader at run-time.

  • M can be very large, e.g.: 21M passages, and k is usually small, e.g.: 20–100.

1.2. Overview

Dense passage retriever (DPR) uses a dense encoder EP(.) which maps any text passage to a d-dimensional real-valued vectors and builds an index for all the M passages.

At run-time, DPR applies a different encoder EQ(.) that maps the input question to a d-dimensional vector, and retrieves k passages of which vectors are the closest to the question vector.

  • The similarity between the question and the passage is using the dot product:

1.3. Encoders

  • Two independent BERT are used. The representation at the [CLS] token is taken as the output. d=768.

1.4. Inference

  • The passage encoder EP is applied to all the passages and index them using FAISS where FAISS is an extremely efficient, open-source library for similarity search.

Given a question q at run-time, its embedding vq = EQ(q) and the top k passages with embeddings closest to vq are retrieved.

1.5. Training

  • The training data that consists of m instances:

Each instance contains one question qi and one relevant (positive) passage p+i, along with n irrelevant (negative) passages p-i,j. The loss function is optimized as the negative log likelihood of the positive passage:

  • In-Batch Negatives: Negative samples in the batch are reused for efficient training. (Please read the paper directly for more details.)

2. Results

2.1. Setup

  • The English Wikipedia dump from Dec. 20, 2018 is used.
  • After some preprocessing, each article is split into multiple, disjoint text blocks of 100 words as passages, serving as basic retrieval units, which results in 21,015,324 passages in the end.
  • Each passage is also prepended with the title of the Wikipedia article where the passage is from, along with an [SEP] token.
5 QA Datasets
  • 5 QA datasets are used for evaluation.

2.2. Passage Retrieval Results

Passage Retrieval Results

For the top-k accuracy (k=20 and 100), with the exception of SQuAD, DPR performs consistently better than BM25 on all datasets.

  • The gap is especially large when k is small (e.g., 78.4% vs. 59.1% for top-20 accuracy on Natural Questions).

A dense passage retriever trained using only 1,000 examples already outperforms BM25.

Effect of in-batch (IB) negative training
  • Effectively, in-batch negative training is an easy and memory-efficient way to reuse the negative examples already in the batch rather than creating new ones. It produces more pairs and thus increases the number of training examples, which might contribute to the good model performance.

As a result, accuracy consistently improves as the batch size grows.

2.3. Question Answering Results

Question Answering Results
  • Specifically, let Pi of size L×h be a BERT (base, uncased in the experiments) representation for the i-th passage, L is the maximum length of the passage and h the hidden dimension.
  • The probabilities of a token being the starting/ending positions of an answer span and a passage being selected are defined as:
  • A span score of the s-th to t-th words from the i-th passage is as:
  • and a passage selection score of the i-th passage is as:
  • During training, 1 positive and ~m-1 negative passages are sampled from the top 100 passages returned by the retrieval system (BM25 or DPR) for each question. ~m=24.
  • The training objective is to maximize the marginal log-likelihood of all the correct answer spans in the positive passage (the answer string may appear multiple times in one passage), combined with the log-likelihood of the positive passage being selected.

For large datasets like NQ and TriviaQA, models trained using multiple datasets (Multi) perform comparably to those trained using the individual training set (Single). Conversely, on smaller datasets like WQ and TREC, the multidataset setting has a clear advantage.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.