Brief Review — What Are People Asking About COVID-19? A Question Classification Dataset

COVID-Q, COVID-19 Question Dataset

Sik-Ho Tsang
3 min readApr 9, 2024

What Are People Asking About COVID-19? A Question Classification Dataset
, by ProtagoLabs, International Monetary Fund, Dartmouth College
2020 ACL Workshop NLP-COVID, Over 40 Citations (Sik-Ho Tsang @ Medium)

Medical/Clinical/Healthcare NLP/LLM
20172023 [MultiMedQA, HealthSearchQA, Med-PaLM] [Med-PaLM 2] [GPT-4 in Radiology] [ChatGPT & GPT‑4 on USMLE] [Regulatory Oversight of LLM] [ExBEHRT] [ChatDoctor] [DoctorGLM] [HuaTuo] 2024 [ChatGPT & GPT-4 on Dental Exam] [ChatGPT-3.5 on Radiation Oncology]
==== My Other Paper Readings Are Also Over Here ====

  • COVID-Q is proposed, which is a set of 1,690 questions about COVID-19 from 13 sources, in which there are 15 question categories and 207 question clusters.


  1. COVID-Q
  2. Benchmarking Results


1.1. Data Collection

In May 2020, authors scraped questions about COVID from 13 sources. The distribution of collected questions from each source is shown in Table 1 above.

1.2. Dataset Cleansing

  • First, deleted questions unrelated to COVID and vague questions with too many interpretations (e.g., “Why COVID?”).
  • Second, removed location-specific and time-specific versions of questions (e.g., “COVID deaths in New York”). Questions that only targeted one location or time, however, were not removed.
  • Finally, removed all punctuation and replaced synonymous ways of saying COVID, such as “coronavirus,” and “COVID-19” with “covid.
  • The number of removed questions for each source is also shown above.

1.3. Data Annotation

Distribution of question clusters

Authors first annotated the dataset by grouping questions that asked the same thing together into question clusters.

Question Categories

Each question cluster was assigned to one of 15 question categories.

  • (There are multiple authors and also Mechanical Turk workers to validate the annotation quality. Please feel free to read the paper for more details.)

2. Benchmarking Results

2.1. Classification

Data Split for Classification
  • For the train-test split, 20 questions are randomly chosen per category for training (as the smallest category has 26 questions), with the remaining questions going into the test set (see Table 3).
Classification Performance
  • BERT is used, and 2 models are used: (1) SVM and (2) cosine-similarity based k-nearest neighbor classification (k-NN) with k-1.

As shown in Table 4, the SVM marginally outperforms k-NN on both the real and generated evaluation sets.

2.2. Clustering

  • 70%–30% train–test split by class, is used.
  • In addition to the k-NN baseline, a simple model that uses a triplet loss function to train a 2-layer neural net on BERT.
  • The baseline models use thresholding to determine whether questions were in the database or novel.

Triplet-loss one obtains better performance.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.