The first free-form multiple-choice OpenQA dataset for solving medical problems, MedQA, is proposed.
It covers three languages: English, simplified Chinese, and traditional Chinese, and contains 12,723, 34,251, and 14,123 questions for the three languages, respectively.
(This dataset is also evaluated by Med-PaLM and Med-PaLM 2 recently.)
Outline
MedQA Dataset
Benchmarking Results
1. MedQA Dataset
The source of the dataset is designed to examine the doctors’ professional capability and thus contains a significant number of questions that require multi-hop logical reasoning.
It is the first publicly available large-scale multiple-choice OpenQA dataset for the medical problems.
It is cross-lingual, covering English and simplified/traditional Chinese.
1.1. Task
Press enter or click to view image in full size
Two Examples
The task is defined by its three components:
Question: question in text, either in one sentence asking for a certain piece of knowledge, or in a long paragraph starting with a description of the patient condition.
Answer candidates: multiple answer options are given for each question, of which only one should be chosen as the most appropriate.
Document collection: a collection of text material extracted from a variety of sources and organized into paragraphs, which contains the knowledge and information to help find the answers.
This task is to determine the best answer to the question among the candidates, relying on the documents.
Each question has 4 options.
Two examples are shown above.
1.2. Data
Press enter or click to view image in full size
Dataset Comparison
Compared with other datasets, none of the prior related datasets have been formulated as an OpenQA problem.
MedQA Statistics
Document Collection Statistics
The questions and their associated answer candidates are collected from the National Medical Board Examination in the USA (https://www.usmle.org/, Accessed on 10 March 2021), Mainland China (http://www.nmec.org.cn, Accessed on 5 April 2021), and Taiwan (https://wwwq.moex.gov.tw/exam/wFrmExam-QandASearch.aspx, Accessed on 23 March 2021). For convenience, they are named as USMLE, MCMLE, and TWMLE.
80% training, 10% development, and 10% test.
Human Experts Evaluation
The above table shows 2 medical experts with the MD degree annotate how many of them can be answered by the evidence from the material.
The collected text materials can provide enough information for answering most of the questions in our data.
1.3. Analysis
Professional Knowledge: The answering of every question in the dataset needs abundant professional domain-specific knowledge, particularly medical knowledge, which forces the model to have a deep understanding of the extracted context.
Diversity of Questions: There are two categories of questions: Type 1: The question is asking for a single piece of knowledge, which needs one-step reasoning. Type 2: questions require multi-hop reasoning and are thus much more complicated than type 1 ones.
Complex Reasoning over Multiple Evidence: Many questions in our data involve complex multi-hop reasoning over several evidence snippets.
Noisy Evidence Retrieval: Retrieving relevant information from large-scale text is much more challenging than reading a short piece of text.
1.4. Approaches
There are rule-based and deep learnng based approaches for benchmarking. (Please read the paper directly for more information.)
2. Benchmarking Results
MCMLE
USMLE and TWMLE
Overall, even the strongest pretrained model (BioBERT-Large, RoBERTa-Large) cannot harvest good scores on any of the three datasets, validating thegreat challenge of the proposed data.
(Please read the paper directly for other results.)