Brief Review — MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering

Baseline Results Only Answers 47% Correctly, Far Behind Human Performance of 90%

Sik-Ho Tsang
4 min readDec 4, 2023
Samples from the MedMCQA dataset, along with the answer’s explanation.

MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering
, by Saama AI Research Chennai
2022 CHIL, Over 60 Citations (Sik-Ho Tsang @ Medium)

Medical NLP/LLM
20172020 [BioBERT] [BEHRT] 2021 [MedGPT] [Med-BERT] [MedQA] 2023 [Med-PaLM]
==== My Other Paper Readings Are Also Over Here ====

  • A new large-scale Multiple-Choice Question Answering (MCQA) dataset, MedMCQA, is introduced to address real-world medical entrance exam questions.
  • More than 194k high-quality AIIMS & NEET PG entrance exam MCQs covering 2.4k healthcare topics and 21 medical subjects are collected with an average token length of 12.77 and high topical diversity.
  • Each sample contains a question, correct answer(s), and other options which requires a deeper language understanding.


  1. MedMCQA Dataset
  2. Baseline Models
  3. Results

1. MedMCQA Dataset

Dataset Comparisons

1.1. Task

The MedMCQA task can be formulated as X = {Q, O} where Q represents the questions in the text, O represents the candidate options, multiple candidate answers are given for each question O = {O1, O2, …, On}.

The ground truth label of a data point is y ∈ R^n where yi = {0, 1} and n is the number of options, the objective is to learn a prediction function f: Xy.

  • The goal is to select the single or multiple answers from the option set.

1.2. Dataset Collection

All India Institute of Medical Sciences (AIIMS PG) & National Eligibility cum Entrance Test (NEET PG) are the two medical entrance exams conducted by All India Institute for Medical Sciences (AIIMS) & National Board of Examinations (NBE), respectively, for providing admission to the postgraduate medical courses.

  • The applicants must have obtained an Bachelor of Medicine and Bachelor of Surgery (MBBS) from a recognized institute to appear for the exams.

The raw data is collected from open websites and books that put together several mock tests and online test series created by medical professionals. In addition to the collected data, AIIMS & NEET PG examination questions (1991- present) from the official websites are also used to create the MedMCQA.

  • The dataset contains MCQs with fine-grained human-labeled classes on various graduation level medical subjects.

Each sample contains ID, question, correct answer, and options. Besides, an explanation of the solution is also provided.

1.3. Preprocessing & Quality Checks

  1. Questions with an inconsistent format were excluded.
  2. Questions whose validity relied on external information were filtered.
  3. Questions containing the keywords “equation”, “India”, “graph”, “map” etc. were removed.
  4. etc.
  5. ‘Grammarly’ was used to fix the grammar, punctuation, and spelling mistakes.
  6. Duplicated questions were removed.

The final dataset contains 193,155 questions.

1.4. Split Criteria

  • The split is by exams instead of the given questions. This also ensures the reusability and generalization ability of the models.
  • To avoid information leakage, the Levenshtein distance between each pair of questions was computed in the entire dataset. If the similarity between the two documents was larger than 0.9, the question was excluded from the development and test set.

The final dataset contains 183K train examples, 6K in the development set, and 4K in the test set.

1.5. Statistics

Dataset Statistics

The train, development, and test sets are with an average token length of 12.35, 13.91 & 9.68, respectively.

2. Baseline Models

2.1. Transformer-Based Models


  • They are fine-tuned on the proposed training dataset in a multi-class classification fashion.

2.2. Retriever Model

Retriever Model

Dense Passage Retrieval (DPR) (Karpukhin et al., 2020), and PubMedBERT are utilized.

  • DPR follows a siamese/biencoder architecture; One encoder encodes the documents and another to encode the query, originally trained with Maximum inner product search objective.

2.3. Training Settings

  • Out-Domain: Pre-trained models are trained on out-domain corpora like Wikipedia and Book corpus.
  • Mix domain (continual): Pre-trained models are trained on out-domain initially and later adapted to in-domain or trained from scratch on both out-domain and in-domain corpora.
  • In-Domain: Pre-trained models are trained from scratch on in-domain corpora like PubMed abstracts and full texts.

3. Results

PubMedBERT performs better than other models in all the categories.

The subject wise accuracies of the top PubMedBERT model is presented in the above Table 3.

Some Correct Prediction Examples
  • (Please read the paper directly for more examples.)



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.