Brief Review — Exploring the Boundaries of GPT-4 in Radiology

Evaluation of GPT-4 on Text-Based Radiology Reports

Sik-Ho Tsang
3 min readDec 15, 2023
(Image from Anete Lusina)

Exploring the Boundaries of GPT-4 in Radiology
GPT-4 in Radiology
, by Microsoft Health Futures, and Harvard University
2023 EMNLP (Sik-Ho Tsang @ Medium)

Medical/Clinical NLP/LLM
20172023 [MultiMedQA, HealthSearchQA, Med-PaLM] [Med-PaLM 2]
==== My Other Paper Readings Are Also Over Here ====

  • GPT-4 is evaluated on the text-based applications for radiology reports.


  1. Experimental Setup
  2. Results

1. Experimental Setup

  • Alongside GPT-4 (gpt-4–32k), two earlier GPT-3.5 models are evaluated: text-davinci-003 and ChatGPT (gpt-35-turbo).
  • For each task, zero-shot prompting is started with and prompt complexity is progressively increased to include random few-shot (a fixed set of random examples), and then similarity-based example selection (Liu et al., 2022). For example selection, OpenAI’s general-domain text-embedding-ada-002 model is used to encode the training examples as the candidate pool to select n nearest neighbours for each test instance.
  • For NLI, Chain-of-Thought (CoT) is also explored.
  • For findings summarisation, ImpressionGPT is replicated (Ma et al., 2023), which adopts dynamic example selection and iterative refinement.
  • To test the stability of GPT-4 output, self-consistency is applied for sentence similarity, NLI, and disease classification.
  • Mean and standard deviation across five runs are reported.

2. Results

2.1. GPT-4 vs SOTA Radiology Models

Results Overview

The key finding is that GPT-4 outperforms or is on par with SOTA radiology models in the broad range of tasks considered.

  • It is further noticed that different tasks require different prompting efforts and strategies.

2.2. Different Tasks in Details

Sentence Similarity Tasks

As shown in Table 2, all the GPT models outperform BioViL-T, achieving new SOTA. In particular, GPT-4 significantly outperforms both text-davinci-003 and ChatGPT on MS-CXR-T, indicating an advanced understanding of disease progression.


GPT-4 with CoT achieves a new SOTA on RadNLI, outperforming DoT5 by 10% in macro F1. It is observed that CoT greatly helps in this task especially for GPT-3.5.

Chest ImaGenome Disease Classification
  • As shown in Table 4, there is progressive improvement from text-davinci-003 to ChatGPT and then to GPT-4.
  • GPT-4 zero-shot performance is improved with 10-shot random in-context examples. A further slight improvement is achieved with similarity-based example selection.
  • (For other tasks in details, please feel free to read the paper directly.)



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.