Brief Review — MedQA: What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams

MedQA Dataset, Evaluated by Med-PaLM and Med-PaLM 2

Sik-Ho Tsang
4 min readNov 4, 2023
Large Improvement from Med-PaLM to Med-PaLM 2

What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams
, by Massachusetts Institute of Technology, and Huazhong University of Science and Technology
2021 MDPI Appl. Sci., Over 100 Citations (Sik-Ho Tsang @ Medium)

Medical NLP/LLM
2017 [LiveQA] 2018 [Clinical NLP Overview] 2019 [MedicationQA] [G-BERT] 2020 [BioBERT] [BEHRT] 2021 [MedGPT] 2023 [Med-PaLM]
==== My Other Paper Readings Are Also Over Here ====

  • The first free-form multiple-choice OpenQA dataset for solving medical problems, MedQA, is proposed.
  • It covers three languages: English, simplified Chinese, and traditional Chinese, and contains 12,723, 34,251, and 14,123 questions for the three languages, respectively.
  • (This dataset is also evaluated by Med-PaLM and Med-PaLM 2 recently.)


  1. MedQA Dataset
  2. Benchmarking Results

1. MedQA Dataset

  • The source of the dataset is designed to examine the doctors’ professional capability and thus contains a significant number of questions that require multi-hop logical reasoning.
  • It is the first publicly available large-scale multiple-choice OpenQA dataset for the medical problems.
  • It is cross-lingual, covering English and simplified/traditional Chinese.

1.1. Task

Two Examples
  • The task is defined by its three components:
  • Question: question in text, either in one sentence asking for a certain piece of knowledge, or in a long paragraph starting with a description of the patient condition.
  • Answer candidates: multiple answer options are given for each question, of which only one should be chosen as the most appropriate.
  • Document collection: a collection of text material extracted from a variety of sources and organized into paragraphs, which contains the knowledge and information to help find the answers.
  • This task is to determine the best answer to the question among the candidates, relying on the documents.
  • Each question has 4 options.
  • Two examples are shown above.

1.2. Data

Dataset Comparison
  • Compared with other datasets, none of the prior related datasets have been formulated as an OpenQA problem.
MedQA Statistics
Document Collection Statistics
  • The questions and their associated answer candidates are collected from the National Medical Board Examination in the USA (, Accessed on 10 March 2021), Mainland China (, Accessed on 5 April 2021), and Taiwan (, Accessed on 23 March 2021). For convenience, they are named as USMLE, MCMLE, and TWMLE.
  • 80% training, 10% development, and 10% test.
Human Experts Evaluation
  • The above table shows 2 medical experts with the MD degree annotate how many of them can be answered by the evidence from the material.
  • The collected text materials can provide enough information for answering most of the questions in our data.

1.3. Analysis

  • Professional Knowledge: The answering of every question in the dataset needs abundant professional domain-specific knowledge, particularly medical knowledge, which forces the model to have a deep understanding of the extracted context.
  • Diversity of Questions: There are two categories of questions: Type 1: The question is asking for a single piece of knowledge, which needs one-step reasoning. Type 2: questions require multi-hop reasoning and are thus much more complicated than type 1 ones.
  • Complex Reasoning over Multiple Evidence: Many questions in our data involve complex multi-hop reasoning over several evidence snippets.
  • Noisy Evidence Retrieval: Retrieving relevant information from large-scale text is much more challenging than reading a short piece of text.

1.4. Approaches

  • There are rule-based and deep learnng based approaches for benchmarking. (Please read the paper directly for more information.)

2. Benchmarking Results


Overall, even the strongest pretrained model (BioBERT-Large, RoBERTa-Large) cannot harvest good scores on any of the three datasets, validating the great challenge of the proposed data.

  • (Please read the paper directly for other results.)



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.