Brief Review — SQuAD: 100,000+ Questions for Machine Comprehension of Text

Question Answering (QA) Datasets: SQuAD 1.0 and SQuAD 1.1

Sik-Ho Tsang
4 min readOct 1


Stanford NLP

SQuAD: 100,000+ Questions for Machine Comprehension of Text
, by Stanford University
2016 EMNLP, Over 6700 Citations (Sik-Ho Tsang @ Medium)

Question Answering
==== My Other Paper Readings Are Also Over Here ====

  • Stanford Question Answering Dataset (SQuAD) is proposed, which is a dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage.
  • Exact Match (EM) score and F1 score are used as metric to evaluate the Question Answering (QA) performance.
  • A strong logistic regression model, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%). However, human performance (86.8%) is much higher.
  • Later, SQuAD 2.0 is proposed and published in 2018 ACL.
  • (SQuAD is one of the popular LLM downstream tasks. Yet, I never go into deep of how they are evaluated for the performance. By reading this paper, I can better understand what metrics are used for QA dataset evaluation.)


  1. Stanford Question Answering Dataset (SQuAD)
  2. Logistic Regression Model
  3. Metrics & Results

1. Stanford Question Answering Dataset (SQuAD)

1.1. Datasets

Left: Example, Right: Crowd-facing Interfance

Reading Comprehension (RC), or the ability to read text and then answer questions about it, is a challenging task for machines, requiring both understanding of natural language and knowledge about the world.

  • Questions are posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage.
  • From each of these articles, individual paragraphs are extracted. Images, figures, tables, and discarding paragraphs shorter than 500 characters are stripped away.

Stanford Question Answering Dataset v1.0 (SQuAD) contains 107,785 question-answer pairs, 23,215 paragraphs, on 536 articles, covering a wide range of topics, from musical celebrities to abstract concepts. It does not provide a list of answer choices for each question.

  • The articles are randomly split into a training set (80%), a development set (10%), and a test set (10%).

1.2. Dataset Comparisons & Statistics

Dataset Comparisons & Statistics
  • Table 1: Existing datasets for RC have one of two shortcomings:
  • (i) those that are high in quality (Richardson et al., 2013; Berant et al., 2014) are too small for training modern data-intensive models, while
  • (ii) those that are large (Hermann et al., 2015; Hill et al., 2015) are semi-synthetic and do not share the same characteristics as explicit reading comprehension questions, i.e. the dataset requires less reasoning than previously thought, and concluded that performance is almost saturated.
  • Table 2: we can see dates and other numbers make up 19.8% of the data; 32.6% of the answers are proper nouns of three different types; 31.8% are common noun phrases answers; and the remaining 15.8% are made up of adjective phrases, verb phrases, clauses and other types.

2. Logistic Regression Model

Features Used in Logistic Regression Model
  • A number of handcrafted features, as above, is extracted and input to Logistic Regression Model for training, to generate a baseline results, showing that the dataset is challenging.
  • (Please read the paper directly for the details of each feature.)

3. Metrics & Results

3.1. Metrics

Exact match: This metric measures the percentage of predictions that match any one of the ground truth answers exactly.

(Macro-averaged) F1 score: This metric measures the average overlap between the prediction and ground truth answer. The prediction and ground truth are treated as bags of tokens, and their F1 is computed. The maximum F1 over all of the ground truth answers is taken for a given question, and then average over all of the questions.

  • (It is better to know how they are calculated exactly by reading codes.)

3.2. Results


Table 5: The logistic regression model significantly outperforms the baselines, but underperforms humans.

  • Table 6: By removing one group of features from the proposed model at a time, it is found that lexicalized and dependency tree path features are most important.
  • Table 7: Model performs best on dates and other numbers. The model is challenged more on other named entities.
  • Figure 5: The more syntactic divergence between the question and answer sentence there is, the lower the performance of the logistic regression model. Interestingly, humans do not seem to be sensitive to syntactic divergence.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.