Review — Med-PaLM: Large Language Models Encode Clinical Knowledge

Med-PaLM, Instruction Prompt Tuning Flan-PaLM

Sik-Ho Tsang
6 min readOct 7, 2023
Left: Proposed New Medical LLM Benchmark; Middle: Intruction-Tuned LLM, Med-PaLM; Right: Human Evaluation

Large Language Models Encode Clinical Knowledge
MultiMedQA, HealthSearchQA, Med-PaLM
, by Google Research, DeepMind
2023 Nature, Over 220 Citations (Sik-Ho Tsang @ Medium)

Medical Large Language Model (LLM)
==== My Other Paper Readings Are Also Over Here ====

  • A benchmark for medical LLM is presented: MultiMedQA, which combines 6 existing open question answering (QA) datasets spanning professional medical exams, research, and consumer queries; and HealthSearchQA, a new free-response dataset of medical questions searched online.
  • An instruction prompt tuning is also introduced, which is a parameter-efficient approach for aligning LLMs to new domains using a few exemplars.
  • A framework is also proposed for human evaluation of model answers along multiple axes including factuality, precision, possible harm, and bias.
  • The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. human evaluations reveal important limitations of today’s models.


  1. MultiMedQA & Proposed HealthSearchQA
  2. Human Evaluation Framework
  3. Instruction Prompt Tuning for Med-PaLM
  4. Results

1. MultiMedQA & Proposed HealthSearchQA

MultiMedQA & HealthSearchQA

1.1. MultiMedQA — A benchmark for medical question answering

MultiMedQA includes multiple-choice question answering datasets, datasets requiring longer-form answers to questions from medical professionals, and datasets requiring longer-form answers to questions that might be asked by non-professionals.

  • These include the MedQA, MedMCQA, PubMedQA, LiveQA, MedicationQA and MMLU clinical topics datasets.

Authors further augmented MultiMedQA with a new dataset of curated commonly searched health queries: HealthSearchQA.

  • All the datasets are in English.
  • While MedMCQA, PubMedQA, LiveQA, and MedicationQA provide reference long-form answers or explanations, they are NOT used in this work.
  • Given the safety-critical requirements of the medical domain, metric is gone beyond BLEU. Human evaluation is involved.

1.2. Some Details About Each Dataset

  • (Please skip to Section 2 to skip the details of each dataset.)
  1. MedQA: consists of US Medical License Exam (USMLE) style questions, which were obtained with a choice of 4 or 5 possible answers from the National Medical Board Examination in the USA. The development set consists of 11450 questions and the test set has 1273 questions.
  2. MedMCQA: consists of more than 194k 4-option multiple-choice questions from Indian medical entrance examinations (AIIMS/NEET). This dataset covers 2.4k healthcare topics and 21 medical subjects. The development set is substantial, with over 187k questions.
  3. PubMedQA: consists of 1k expert labeled question answer pairs where the task is to produce a yes/no/maybe multiple-choice answer. This task is closed domain that it requires answer inference from the supporting PubMed abstract context.
  4. “Measuring Massive Multitask Language Understanding” (MMLU) [29]: includes exam questions from 57 domains. The subtasks that are most relevant to medical knowledge are selected: “anatomy”, “clinical knowledge”, “college medicine”, “medical genetics”, “professional medicine”, and “college biology”. Each MMLU subtask contains multiple-choice questions with four options, along with the answers.
  5. LiveQA: was curated as part of the Text Retrieval Challenge (TREC) 2017. It consists of medical questions submitted by people to the National Library of Medicine (NLM). It also consists of manually collected reference answers from trusted sources such as the National Institute of Health (NIH) website.
  6. MedicationQA: consists of commonly asked consumer questions about medications. In addition to the question, the dataset contains annotations corresponding to drug focus and interactions. Similar to LiveQA, we evaluate models’ ability to produce long form answers to the questions in the test set.
  7. HealthSearchQA: Authors curated their own additional dataset consisting of 3375 commonly searched consumer questions. The dataset was curated using seed medical conditions and their associated symptoms. Authors used the seed data to retrieve publicly-available commonly searched questions generated by a search engine, which were displayed to all users entering the seed terms.

2. Human Evaluation Framework

Human Evaluation Questions for Clinicians

A pool of clinicians is employed to evaluate the quality of model and human-generated answers along the above axes in the table for long-form model answers.

  • Alignment: with scientific consensus was measured by asking raters whether the output of the model was aligned with / opposed to / not clear to a prevailing scientific consensus.
  • Harm: Raters were asked to focus solely on physical/mental health-related harms.
  • Bias: See if the answer contained information that would be inapplicable or inaccurate to a specific patient demographic.
Human Evaluation Questions for Lay Users

Besides clinicians, five raters without a medical background also evaluate the answers.

3. Instruction Prompt Tuning for Med-PaLM

Instruction Prompt Tuning for Med-PaLM
  • Given the safety critical nature of the medical domain, it is necessary to adapt and align the model with domain-specific data.
  • For this additional training, prompt tuning is used instead of full-model fine-tuning given compute and clinician data generation costs.

The soft prompt is used as an initial prefix that is shared across multiple medical datasets, and which is followed by the relevant task-specific human-engineered prompt, i.e. using a prompt-based approach using prompt template, (consisting of instructions and/or few-shot exemplars, which may be chain-of-thought examples) along with the actual question and/or context.

40 examples across HealthSearchQA, MedicationQA, and LiveQA are used for instruction prompt tuning training. 3 examples are as shown above.

4. Results

4.1. Multiple-Choice Question (MCQ) Accuracy

Med-PaLM outperforms SOTA approaches on MedMCQA, MedQA and PubMedQA.

4.2. Human Evaluation

Human Evaluation

Fig 4a: clinicians’ answers to be aligned with the scientific consensus in 92.9% of questions, whereas Flan-PaLM was found to be in agreement with the scientific consensus in only 61.9% of answers.

92.6% of Med-PaLM answers were judged to be in accordance with the scientific consensus, showcasing the strength of instruction prompt tuning as an alignment technique to produce scientifically grounded answers.

Human Evaluation

Answers generated by experts were again superior to those of Flan-PaLM, although performance was improved by instruction prompt tuning for Med-PaLM.

Human Evaluation

Fig. 6b: Flan-PaLM answers were judged to be helpful in only 60.6% of the cases, this increased to 80.3% for Med-PaLM answers. However, this remained inferior to the answers given by clinicians, which were judged to be helpful 91.1% of the time

Fig. 6a: Similarly, Flan-PaLM answers were judged as directly addressing the intent of the user’s question in 90.8% of cases. This increased to 94.4% for Med-PaLM, whereas the clinician-generated answers were judged as directly addressing intent in 95.9% of cases.

  • (It is only a brief description of Med-PaLM, please feel free to read the paper directly for more details.)



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.