Brief Review — QuAC : Question Answering in Context

QuAC, A Question Answering Dataset is Proposed

Sik-Ho Tsang
5 min readJun 6, 2024
An example dialog about a Wikipedia section

QuAC : Question Answering in Context
, by Allen Institute for Artificial Intelligence, University of Washington, Stanford University, and UMass Amherst
2018 EMNLP, Over 850 Citations (Sik-Ho Tsang @ Medium)

Question Answering (QA)
2016 [SQuAD 1.0/1.1] 2017 [Dynamic Coattention Network (DCN)] 2018 [SQuAD 2.0]
==== My Other Paper Readings Are Also Over Here ====

  • QuAC, a dataset for Question Answering in Context, is proposed that contains 14K information-seeking QA dialogs (100K questions in total).
  • In the QA, the dialogs involve two crowd workers:
  1. A student who poses a sequence of freeform questions to learn as much as possible about a hidden Wikipedia text, and;
  2. A teacher who answers the questions by providing short excerpts from the text.


  1. QuAC Dataset
  2. Models
  3. Results

1. QuAC Dataset

1.1. Interactive Task

  • The task pairs up two workers, a teacher and a student, who discuss a section s (e.g., “Origin & History” in the example from figure at the top) from a Wikipedia article about an entity e (Daffy Duck).
  • The student is permitted to see only the section’s title t and the first paragraph of the main article b, while the teacher is additionally provided with full access to the section text.

The task begins with the student formulating a free-text question q.

The teacher is not allowed to answer with free text; instead, they must select a contiguous span of text defined by indices (i, j) into the section text s.

The teachers must also provide the student with a list of dialog acts v that indicates the presence of any of n discrete statements. There are 3 acts: (1) continuation (follow up, maybe follow up, or don’t follow up), (2) affirmation (yes, no, or neither) and (3) answerability (answerable or no answer).

  • After receiving an answer from the teacher, the student asks another question. At every turn, the student has more information about the topic than they did previously, which encourages them to ask follow-up questions about what they have just learned.

The dialog continues until (1) 12 questions are answered, (2) 1 of the partners decides to end the interaction, or (3) more than 2 unanswerable questions were asked.

1.2. Data Collection

  • Amazon Mechanical Turk (AMT) is used for data collection.
  • Workers are paid per the number of completed turns in the dialog, which encourages workers to have long dialogs with their partners, and discarded dialogs with less than three QA pairs.
  • Some rule filtering is also performed.
  • (Many details for this part, please go through the paper if interested.)
Dataset statistics.
  • The above table shows the dataset statistics.
QuAC expands QA for Dialog Acts
  • QuAC expands QA for Dialog Acts.
  • The most frequent question types based on “Wh” words are shown.
A successful and a less successful dialogs
  • Two more dialog examples, are shown above.
Heatmap Analysis
  • The answer to the next question is most frequently either in the same chunk.
  • The frequency of yes/no questions increases significantly as the dialogs progress.

2. Models

2.1. Baseline

  • Random sentence: This baseline selects a random sentence in the section text s as the answer.
  • Majority: The majority answer outputs no answer/
  • Transition Matrix (TM): Divide the supporting text into 12 chunks, use the transition matrix (computed from the training set) in Figure 5b to select an answer.

2.2. Upper Bounds

  • Gold NA + TM: TM + for questions whose gold annotations are no answer, always output no answer.
  • Gold sentence + NA: output the sentence from s with the maximal F1 with respect to references.
  • Human performance: pick one reference as a system output and compute the F1 with respect to the remaining references.

2.3. Models

  • Pretrained InferSent: Pretrained InferSent representation (Conneau et al., 2017).
  • Feature-rich logistic regression: train a logistic regression using Vowpal Wabbit.
  • BiDAF++: use a re-implementation of a top-performing SQuAD model.
  • BiDAF++ w/ k-ctx: modify the passage and question embedding processes to consider the dialog history. In this case, context from the previous k QA pairs is considered.

3. Results

Experimental results of sanity checks (top), baselines (middle) and upper bounds (bottom)
  • Baselines have poor results.
  • The human upper bound (80.8 F1) demonstrates high agreement.
  • The Gold NA + TM shows that cannot be solved by ignoring question and answer text.
  • Text similarity methods such as bag-of-ngrams overlap and InferSent are largely ineffective.

BiDAF++ models make significant progress, demonstrating that existing models can already capture a significant portion of phenomena.

However, even the proposed best model underperforms humans: the system achieves human equivalence on only 60% of questions and 5% of full dialogs.

3.2. Analysis

F1 Score Analysis

In the first plot, human agreement is unchanged throughout the dialog.

The second plot shows that human disagreement increases as the distance between the current answer’s location within the section text and that of the previous answer increases.

In the last plot, human agreement is higher when the answer span is short.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.