Brief Review — SQuAD 2.0: Know What You Don’t Know: Unanswerable Questions for SQuAD

SQuAD 2.0 Dataset is Proposed, Including Unanswerable Questions

Sik-Ho Tsang
3 min readOct 8, 2023
Two unanswerable questions written by crowdworkers, along with plausible (but incorrect) answers. Relevant keywords are shown in blue.

Know What You Don’t Know: Unanswerable Questions for SQuAD
SQuAD 2.0
, by Stanford University
2018 ACL, Over 2200 Citations (Sik-Ho Tsang @ Medium)

Question Answering (QA)
2016 [SQuAD 1.0/1.1]

  • Existing reading comprehension systems tend to make unreliable guesses on questions for which the correct answer is not stated in the context.
  • SQuAD 2.0 combines existing SQuAD data with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD 2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.

Outline

  1. SQuAD 2.0
  2. Results

1. SQuAD 2.0

  • For each paragraph in the article, workers were asked to pose up to five questions that were impossible to answer based on the paragraph alone, while referencing entities in the paragraph and ensuring that a plausible answer is present.
  • As inspiration, questions from SQuAD 1.1 are also shown for each paragraph; this further encouraged unanswerable questions to look similar to answerable ones. Workers were asked to spend 7 minutes per paragraph, and were paid $10.50 per hour.
Statistics
  • A roughly one-to-one ratio of answerable to unanswerable questions in development and test splits, whereas the train data has roughly twice as many answerable questions as unanswerable ones, as above.
Types of negative examples
  • 100 randomly chosen negative examples from the development set to understand the challenges these examples present.

2. Results

EM and F1 Scores

The best model, DocQA + ELMo, achieves only 66.3 F1 on the test set, 23.2 points lower than the human accuracy of 89.5 F1.

  • Note that a baseline that always abstains gets 48.9 test F1; existing models are closer to this baseline than they are to human performance.
EM and F1 Scores

The highest score on SQuAD 2.0 is 15.4 F1 points lower than the highest score on either of the other two datasets, suggesting that automatically generated negative examples are much easier for existing models to detect.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.