Brief Review — PubMedQA: A Dataset for Biomedical Research Question Answering

PubMedQA: Yes/No/Maybe QA Dataset

Sik-Ho Tsang
5 min readOct 29, 2023
An instance (Sakamoto et al., 2011) of PubMedQA dataset: Question is the original question title; Context includes the structured abstract except its conclusive part, which serves as the Long Answer; Human experts annotated the Answer yes.

PubMedQA: A Dataset for Biomedical Research Question Answering
PubMedQA, by University of Pittsburgh, Carnegie Mellon University, Google AI,
2019 EMNLP-IJCNLP, Over 230 Citations (Sik-Ho Tsang @ Medium)

Medical NLP
==== My Other Paper Readings Are Also Over Here ====

  • PubMedQA, a novel biomedical question answering (QA) dataset, is introduced, which is collected from PubMed abstracts.
  • The task of PubMedQA is to answer research questions with yes/no/maybe using the corresponding abstracts.
  • PubMedQA has 1k expert-annotated, 61.2k unlabeled and 211.3k artificially generated QA instances.
  • (This is a dataset evaluated by Med-PaLM.)


  1. PubMedQA
  2. BioBERT Fine-Tuning
  3. Results

1. PubMedQA

3 PubMedQA Subsets
  • PubMedQA is split into 3 subsets: labeled, unlabeled and artificially generated. They are denoted as PQA-L(abeled), PQA-U(nlabeled) and PQA-A(rtificial), respectively, as above.
  • Collection of PQA-L and PQA-U: PubMed articles which have i) a question mark in the titles and ii) a structured abstract with conclusive part are collected and denoted as pre-PQA-U.

1.1. PQA-L

  • Two annotators labeled 1k instances from pre-PQA-U with yes/no/maybe to build PQA-L.
  • Reasoning-free setting: The annotator 1 doesn’t need to do much reasoning to annotate since the long answer is available.
  • Reasoning-required setting: Annotator 2 cannot use the long answer, so reasoning over the context is required for annotation.

1.2. PQA-U

  • The unlabeled instances in pre-PQA-U with yes/no/maybe answerable questions are included to build PQA-U.

1.3. PQA-A

PQA-A Examples
  • A simple heuristic is used to collect many noisily-labeled instances to build PQA-A for pretraining.
  • 2 examples are shown above.

1.4. PubMedQA Statistics

PubMedQA Statistics
  • Figure 3: PubMed abstracts are manually annotated by medical librarians with Medical Subject Headings (MeSH). MeSH terms are used to represent abstract topics, and their distribution is visualized.
  • Most (57.5%) involve comparing multiple groups (e.g.: experiment and control), and others require interpreting statistics of a single group or its subgroups. Reasoning over quantitative contents is required in nearly all (96.5%) of them.
  • Figure 4: The Sankey diagram is used to show the proportional relationships.
PubMedQA Summary

2. BioBERT Fine-Tuning

2.1. Loss Function

  • With regularizing neural machine translation models with binary bag-of-word (BoW) statistics, BioBERT is fine-tuned with an auxiliary task of predicting the binary BoW statistics of the long answers:
  • where bi and ^bi are ground-truth and predicted probability of whether token i is in the long answers. N is the BoW vocabulary size.
  • The total loss is:

2.2. Multi-Phase Fine-Tuning Architecture

Conceptually, two BioBERTs are trained. One is trained to generate pseudo-labels for PQA-U, which can enlarge the training set. Another one is trained using both labeled and pseudo-labelled dataset for later evaluation.

  • (Please skip the details below for quick reading.)
Multi-Phase Fine-Tuning Architecture
  • q, c, a, l are used to denote question, context, long answer and yes/no/maybe label of instances, respectively.

2.2.1. Phase I Fine-tuning on PQA-A

  • Pretrained BioBERT is initialized with θ0, and fine-tuned on PQA-A using question and context as input:

2.2.2. Phase II Fine-tuning on Bootstrapped PQA-U

  • To fully utilize the unlabeled instances in PQA-U, pseudo-labeling with a bootstrapping is strategy used. BioBERT is initialized with θ0, and fine-tuned on PQA-A using question and long answer (reasoning-free):
  • Then further fine-tune BioBERTθB1 on PQA-L:
  • Pseudo-label PQA-U instances using the most confident predictions of BioBERTθB2 for each class, are used:
  • Next, BioBERTθI is fine-tuned on the bootstrapped PQA-U using question and context (under reasoning-required setting):

2.2.3. Final Phase Fine-tuning on PQA-L

  • In the final phase, BioBERTθII is fine-tuned on PQA-L:
  • Final predictions on instances of PQA-L validation and test sets are made using BioBERTθF:

3. Results

3.1. Human Baseline

Human Baseline

3.2. PQA-L Test Set

PQA-L Test Set

In general, multi-phase fine-tuning of BioBERT with additional supervision outperforms other baselines by large margins, but the results are still much worse than just single-human performance.

  • (Please read the paper directly for more experimental results.)



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.