Brief Review — MedQA: What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams

MedQA Dataset, Evaluated by Med-PaLM and Med-PaLM 2

4 min readNov 4, 2023

**Large Improvement from** **Med-PaLM** **to Med-PaLM 2**

What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams
MedQA, by Massachusetts Institute of Technology, and Huazhong University of Science and Technology
2021 MDPI Appl. Sci., Over 100 Citations (Sik-Ho Tsang @ Medium)
Medical NLP/LLM
2017 [LiveQA] 2018 [Clinical NLP Overview] 2019 [MedicationQA] [G-BERT] 2020 [BioBERT] [BEHRT] 2021 [MedGPT] 2023 [Med-PaLM]
==== My Other Paper Readings Are Also Over Here ====

The first free-form multiple-choice OpenQA dataset for solving medical problems, MedQA, is proposed.
It covers three languages: English, simplified Chinese, and traditional Chinese, and contains 12,723, 34,251, and 14,123 questions for the three languages, respectively.
(This dataset is also evaluated by Med-PaLM and Med-PaLM 2 recently.)

Outline

MedQA Dataset
Benchmarking Results

1. MedQA Dataset

The source of the dataset is designed to examine the doctors’ professional capability and thus contains a significant number of questions that require multi-hop logical reasoning.
It is the first publicly available large-scale multiple-choice OpenQA dataset for the medical problems.
It is cross-lingual, covering English and simplified/traditional Chinese.

1.1. Task

The task is defined by its three components:
Question: question in text, either in one sentence asking for a certain piece of knowledge, or in a long paragraph starting with a description of the patient condition.
Answer candidates: multiple answer options are given for each question, of which only one should be chosen as the most appropriate.
Document collection: a collection of text material extracted from a variety of sources and organized into paragraphs, which contains the knowledge and information to help find the answers.
This task is to determine the best answer to the question among the candidates, relying on the documents.
Each question has 4 options.
Two examples are shown above.

1.2. Data

Compared with other datasets, none of the prior related datasets have been formulated as an OpenQA problem.

The questions and their associated answer candidates are collected from the National Medical Board Examination in the USA (https://www.usmle.org/, Accessed on 10 March 2021), Mainland China (http://www.nmec.org.cn, Accessed on 5 April 2021), and Taiwan (https://wwwq.moex.gov.tw/exam/wFrmExam-QandASearch.aspx, Accessed on 23 March 2021). For convenience, they are named as USMLE, MCMLE, and TWMLE.
80% training, 10% development, and 10% test.

The above table shows 2 medical experts with the MD degree annotate how many of them can be answered by the evidence from the material.
The collected text materials can provide enough information for answering most of the questions in our data.

1.3. Analysis

Professional Knowledge: The answering of every question in the dataset needs abundant professional domain-specific knowledge, particularly medical knowledge, which forces the model to have a deep understanding of the extracted context.
Diversity of Questions: There are two categories of questions: Type 1: The question is asking for a single piece of knowledge, which needs one-step reasoning. Type 2: questions require multi-hop reasoning and are thus much more complicated than type 1 ones.
Complex Reasoning over Multiple Evidence: Many questions in our data involve complex multi-hop reasoning over several evidence snippets.
Noisy Evidence Retrieval: Retrieving relevant information from large-scale text is much more challenging than reading a short piece of text.

1.4. Approaches

There are rule-based and deep learnng based approaches for benchmarking. (Please read the paper directly for more information.)

2. Benchmarking Results

Overall, even the strongest pretrained model (BioBERT-Large, RoBERTa-Large) cannot harvest good scores on any of the three datasets, validating the great challenge of the proposed data.

(Please read the paper directly for other results.)

Brief Review — MedQA: What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams

MedQA Dataset, Evaluated by Med-PaLM and Med-PaLM 2

Outline

1. MedQA Dataset

1.1. Task

1.2. Data

1.3. Analysis

1.4. Approaches

2. Benchmarking Results

Written by Sik-Ho Tsang

No responses yet