Review — Adapted Large Language Models Can Outperform Medical Experts in Clinical Text Summarization

LLM Better Than Medical Experts?

Sik-Ho Tsang
5 min readApr 25, 2024
Framework overview

Adapted Large Language Models Can Outperform Medical Experts in Clinical Text Summarization
LLM on Clicical Text Summarization
, by Stanford University
2024 Nature Medicine (Sik-Ho Tsang @ Medium)

Medical/Clinical/Healthcare NLP/LLM
20172023 [MultiMedQA, HealthSearchQA, Med-PaLM] [Med-PaLM 2] [GPT-4 in Radiology] [ChatGPT & GPT‑4 on USMLE] [Regulatory Oversight of LLM] [ExBEHRT] [ChatDoctor] [DoctorGLM] [HuaTuo] 2024 [ChatGPT & GPT-4 on Dental Exam] [ChatGPT-3.5 on Radiation Oncology]
==== My Other Paper Readings Are Also Over Here ====

  • 8 LLMs are evaluated on 4 distinct clinical summarization tasks: radiology reports, patient questions, progress notes, and doctor-patient dialogue. Quantitative assessments with syntactic, semantic, and conceptual NLP metrics reveal trade-offs between models and adaptation methods.
  • A clinical reader study with 10 physicians evaluates summary completeness, correctness, and conciseness, compared with medical experts.


  1. LLM on Clicical Text Summarization
  2. Results

1. LLM on Clicical Text Summarization

LLM on Clicical Text Summarization Framework Overview

1.1. LLM

  • The choice of models chosen for evaluation varies widely with respect to number of parameters (2.7 billion to 175 billion) and context length (512 to 32,768), as tabulated above.

1.2. Adaptation Methods

  • For adaptation methods, In-context learning (ICL) is used where ICL is a lightweight adaptation method that requires no altering of model weights; instead, one includes a handful of in-context examples directly within the model prompt. m = 2^x examples are used, where x ∈ {0, 1, 2, 3, …, M} for M such that no more than 1% of the s = 250 samples are excluded.
  • Another one is Quantized low-rank adaptation (QLoRA), which employs 4-bit quantization to enable the fine-tuning of larger LLMs given the same hardware constraints.

1.3. Data

  • 4 distinct summarization tasks, comprising 6 open-source datasets, are used to evaluate LLM performance on clinical text summarization, as tabulated above.

1.4. Model Prompts and Temperature

  • Prompt phrasing and model temperature can have a considerable effect on LLM output.
  • It is found that the lowest temperature value of 0.1 performed best.

2. Results

2.1. Alpaca vs. MedAlpaca

Alpaca vs. MedAlpaca

Despite MedAlpaca’s adaptation for the medical domain, it performs worse than Alpaca for the tasks of clinical text summarization.

2.2. ICL vs. QLoRA

One-Example ICL vs. QLoRA

FLAN-T5 emerged as the best-performing model with QLoRA. QLoRA typically outperformed ICL (one example) with the better models FLAN-T5 and Llama-2.

  • Given a sufficient number of in-context examples, however, most models surpass even the best QLoRA fine-tuned model, FLAN-T5.

2.3. Effect of Context Length for ICL

Compared to zero-shot prompting (m = 0 examples), adapting with even m = 1 example considerably improves performance in almost all cases, underscoring the importance of adaptation methods.

  • While ICL and QLoRA are competitive for open-source models, proprietary models GPT-3.5 and GPT-4 far outperform other models and methods given sufficient in-context examples.

2.4. Head-to-head Model Comparison

Head-to-head Model Comparison
  • Seq2seq models (FLAN-T5, FLAN-UL2 [54, 55]) perform well on syntactical metrics such as BLEU [87] but worse on others. Autoregressive models may perform better with increasing data heterogeneity and complexity.

The best model and method is GPT-4.

2.5. Clinical Reader Study

Clinical Reader Study
  • 3 questions for readers to evaluate using a 5-point Likert scale.
  1. Completeness: “Which summary more completely captures important information?” This compares the summaries’ recall, i.e. the amount of clinically significant detail retained from the input text.
  2. Correctness: “Which summary includes less false information?” This compares the summaries’ precision, i.e. instances of fabricated information.
  3. Conciseness: “Which summary contains less non-important information?” This compares which summary is more condensed, as the value of a summary decreases with superfluous information.

As shown above, summaries from the best adapted model (GPT-4 using ICL) are more complete and contain fewer errors.

The best model summaries are more complete on average than medical expert summaries.

As in Figure 7c, medical expert summaries are preferred in only a minority of cases (19%), while in a majority, the best model is either non-inferior (45%) or preferred (36%).

Annotation: radiology reports.

For an example as above, conciseness could be improved with better prompt engineering, or modifying the prompt to improve performance.

2.6. Connecting Quantitative and Clinical Evaluations

Connecting quantitative and clinical evaluations
  • Figure 9 captures the correlation between NLP metrics and physicians’ preference.
  • Compared to other metrics, BLEU correlates most with completeness and least with conciseness.
  • The metrics BERTScore (measuring semantics) and MEDCON (measuring medical concepts) correlate most strongly with reader preference for correctness.

Figure 9 suggests that syntactic metrics are better at measuring completeness, while semantic and conceptual metrics are better at measuring correctness.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.