Brief Review — MedQA: What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams
MedQA Dataset, Evaluated by Med-PaLM and Med-PaLM 2
4 min readNov 4, 2023
What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams
MedQA, by Massachusetts Institute of Technology, and Huazhong University of Science and Technology
2021 MDPI Appl. Sci., Over 100 Citations (Sik-Ho Tsang @ Medium)Medical NLP/LLM
2017 [LiveQA] 2018 [Clinical NLP Overview] 2019 [MedicationQA] [G-BERT] 2020 [BioBERT] [BEHRT] 2021 [MedGPT] 2023 [Med-PaLM]
==== My Other Paper Readings Are Also Over Here ====
- The first free-form multiple-choice OpenQA dataset for solving medical problems, MedQA, is proposed.
- It covers three languages: English, simplified Chinese, and traditional Chinese, and contains 12,723, 34,251, and 14,123 questions for the three languages, respectively.
- (This dataset is also evaluated by Med-PaLM and Med-PaLM 2 recently.)
Outline
- MedQA Dataset
- Benchmarking Results
1. MedQA Dataset
- The source of the dataset is designed to examine the doctors’ professional capability and thus contains a significant number of questions that require multi-hop logical reasoning.
- It is the first publicly available large-scale multiple-choice OpenQA dataset for the medical problems.
- It is cross-lingual, covering English and simplified/traditional Chinese.
1.1. Task
- The task is defined by its three components:
- Question: question in text, either in one sentence asking for a certain piece of knowledge, or in a long paragraph starting with a description of the patient condition.
- Answer candidates: multiple answer options are given for each question, of which only one should be chosen as the most appropriate.
- Document collection: a collection of text material extracted from a variety of sources and organized into paragraphs, which contains the knowledge and information to help find the answers.
- This task is to determine the best answer to the question among the candidates, relying on the documents.
- Each question has 4 options.
- Two examples are shown above.
1.2. Data
- Compared with other datasets, none of the prior related datasets have been formulated as an OpenQA problem.
- The questions and their associated answer candidates are collected from the National Medical Board Examination in the USA (https://www.usmle.org/, Accessed on 10 March 2021), Mainland China (http://www.nmec.org.cn, Accessed on 5 April 2021), and Taiwan (https://wwwq.moex.gov.tw/exam/wFrmExam-QandASearch.aspx, Accessed on 23 March 2021). For convenience, they are named as USMLE, MCMLE, and TWMLE.
- 80% training, 10% development, and 10% test.
- The above table shows 2 medical experts with the MD degree annotate how many of them can be answered by the evidence from the material.
- The collected text materials can provide enough information for answering most of the questions in our data.
1.3. Analysis
- Professional Knowledge: The answering of every question in the dataset needs abundant professional domain-specific knowledge, particularly medical knowledge, which forces the model to have a deep understanding of the extracted context.
- Diversity of Questions: There are two categories of questions: Type 1: The question is asking for a single piece of knowledge, which needs one-step reasoning. Type 2: questions require multi-hop reasoning and are thus much more complicated than type 1 ones.
- Complex Reasoning over Multiple Evidence: Many questions in our data involve complex multi-hop reasoning over several evidence snippets.
- Noisy Evidence Retrieval: Retrieving relevant information from large-scale text is much more challenging than reading a short piece of text.
1.4. Approaches
- There are rule-based and deep learnng based approaches for benchmarking. (Please read the paper directly for more information.)