Brief Review — Med-PaLM 2: Towards Expert-Level Medical Question Answering with Large Language Models

Based on PaLM 2, Outperforms Med-PaLM & Flan-PaLM

Sik-Ho Tsang
5 min readDec 8, 2023
Med-PaLM 2 Obtains SOTA Results on MedQA

Towards Expert-Level Medical Question Answering with Large Language Models
Med-PaLM 2
, by Google Research, DeepMind
2023 arXiv v1, Over 90 Citations (Sik-Ho Tsang @ Medium)

Medical/Clinical NLP/LLM
2017 … 2021 [MedGPT] [Med-BERT] [MedQA] [PubMedBERT] [MMLU] 2023 [Med-PaLM]
==== My Other Paper Readings Are Also Over Here ====

  • Med-PaLM 2 is proposed, which bridges these gaps by leveraging a combination of base LLM improvements (PaLM 2), medical domain finetuning, and prompting strategies including a novel ensemble refinement approach.


  1. Med-PaLM 2 Datasets
  2. Med-PaLM 2 Modelling
  3. Results

1. Med-PaLM 2 Datasets

  • Med-PaLM 2 is evaluated on multiple-choice and long-form medical question-answering datasets from MultiMedQA and two new adversarial long-form datasets.

1.1. Multiple-Choice Questions

Multiple-Choice Questions

For evaluation on multiple-choice questions, MedQA, MedMCQA, PubMedQA and MMLU clinical topics datasets are used.

1.2. Long-Form Questions

Long-Form Questions

The first set (MultiMedQA 140) consists of 140 questions curated from the HealthSearchQA, LiveQA, MedicationQA datasets.

The second set (MultiMedQA 1066), is an expanded sample of 1066 questions sampled from the same sources.

Moreover, two new datasets of adversarial questions are curated/designed to elicit model answers with potential for harm and bias: a general adversarial set and health equity focused adversarial set.

  • The first set (Adversarial — General) broadly covers issues related to health equity, drug use, alcohol, mental health, COVID-19, obesity, suicide, and medical misinformation.
  • The second set (Adversarial — Health equity) prioritizes use cases, health topics, and sensitive characteristics based on relevance to health equity considerations in the domains of healthcare access (e.g., health insurance, access to hospitals or primary care provider), quality (e.g., patient experiences, hospital care and coordination), and social and environmental factors (e.g., working and living conditions, food access, and transportation).

2. Med-PaLM 2 Modelling

Data Mixture

Med-PaLM 2, a new medical LLM trained using a new base model (PaLM 2) and targeted medical domain-specific finetuning.

2.1. Instruction Finetuning

  • Instruction finetuning is applied to the base LLM.

The datasets used included the training splits of MultiMedQA–namely MedQA, MedMCQA, HealthSearchQA, LiveQA and MedicationQA. A “unified” model is trained, which is optimized for performance across all datasets in MultiMedQA using dataset mixture ratios.

  • A variant of Med-PaLM 2 is obtained by finetuning exclusively on multiple-choice questions which led to improved results on these benchmarks.

2.2. Few-shot prompting

Few-shot prompting (GPT-3) involves prompting an LLM by prepending example inputs and outputs before the final input.

2.3. Chain-of-Thought (CoT)

Chain-of-Thought (CoT) involves augmenting each few-shot example in a prompt with a step-by-step explanation towards the final answer.

2.4. Self-Consistency (SC)

  • Performance on multiple-choice benchmarks is improved by sampling multiple explanations and answers from the model. The final answer is the one with the majority (or plurality) vote.

With sampling, different answers comes from different reasoning paths. Marginalizing over the reasoning paths can lead to the most accurate answer.

2.5. Proposed Ensemble Refinement (ER)

Proposed Ensemble Refinement (ER)
  • Building on chain-of-thought and self-consistency, a simple prompting strategy is developed, which is referred to as ensemble refinement (ER).
  • ER involves a two-stage process:

First Stage: First, given a (few-shot) chain-of-thought prompt and a question, the model produces multiple possible generations stochastically via temperature sampling. In this case, each generation involves an explanation and an answer for a multiple-choice question. Then, the model is conditioned on the original prompt, question, and the concatenated generations from the previous step, and is prompted to produce a refined explanation and answer. This can be interpreted as a generalization of self-consistency.

The second stage is performed for multiple times, and then finally a plurality vote is done over these generated answers to determine the final answer.

  • For example, ER can be used to produce improved long-form generations by having an LLM condition on multiple possible responses to generate a refined final answer.
  • Given the resource cost of approaches requiring repeated samplings from a model, ER is applied only for multiple-choice evaluation in this work, with 11 samplings for the first stage and 33 samplings for the second stage.

3. Results

  • (For evaluation settings, please read the paper directly.)

3.1. MC Questions

Comparisons With GPT-4
Different Prompting Strategies

MedQA: The unified Med-PaLM 2 model reaches an accuracy of 85.4% using ER.
MedMCQA: Med-PaLM 2 obtains a score of 72.3%, exceeding Flan-PaLM performance by over 14%.
PubMedQA: Med-PaLM 2 obtains a score of 75.0%.
MMLU clinical topics: Med-PaLM 2 significantly improves over previously reported results in Med-PaLM and is the state-of-the-art on 3 out 6 topics.

Pairwise Ranking

On MultiMedQA, for 8 of the 9 axes, Med-PaLM 2 answers were more often rated as being higher quality compared to physician answers.

3.2. Long-Form Questions

Physicians rated Med-PaLM 2 answers as significantly higher quality than Med-PaLM answers across all axes.

Lay-people rated Med-PaLM 2 answers to questions in the MultiMedQA 140 dataset as more helpful and relevant than Med-PaLM answers.

Long-Form Questions Comparisons

Med-PaLM 2 answers were rated as higher quality than Med-PaLM axes on the same eight axes. Med-PaLM 2 answers were marked as having more inaccurate or irrelevant information less often than Med-PaLM answers.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.