Brief Review — Exploring the Boundaries of GPT-4 in Radiology

Evaluation of GPT-4 on Text-Based Radiology Reports

3 min readDec 15, 2023

Exploring the Boundaries of GPT-4 in Radiology
GPT-4 in Radiology, by Microsoft Health Futures, and Harvard University
2023 EMNLP (Sik-Ho Tsang @ Medium)
Medical/Clinical NLP/LLM
2017 … 2023 [MultiMedQA, HealthSearchQA, Med-PaLM] [Med-PaLM 2]
==== My Other Paper Readings Are Also Over Here ====

GPT-4 is evaluated on the text-based applications for radiology reports.

Outline

Experimental Setup
Results

1. Experimental Setup

Alongside GPT-4 (gpt-4–32k), two earlier GPT-3.5 models are evaluated: text-davinci-003 and ChatGPT (gpt-35-turbo).
For each task, zero-shot prompting is started with and prompt complexity is progressively increased to include random few-shot (a fixed set of random examples), and then similarity-based example selection (Liu et al., 2022). For example selection, OpenAI’s general-domain text-embedding-ada-002 model is used to encode the training examples as the candidate pool to select n nearest neighbours for each test instance.
For NLI, Chain-of-Thought (CoT) is also explored.
For findings summarisation, ImpressionGPT is replicated (Ma et al., 2023), which adopts dynamic example selection and iterative refinement.
To test the stability of GPT-4 output, self-consistency is applied for sentence similarity, NLI, and disease classification.
Mean and standard deviation across five runs are reported.

2. Results

2.1. GPT-4 vs SOTA Radiology Models

The key finding is that GPT-4 outperforms or is on par with SOTA radiology models in the broad range of tasks considered.

It is further noticed that different tasks require different prompting efforts and strategies.

2.2. Different Tasks in Details

As shown in Table 2, all the GPT models outperform BioViL-T, achieving new SOTA. In particular, GPT-4 significantly outperforms both text-davinci-003 and ChatGPT on MS-CXR-T, indicating an advanced understanding of disease progression.

GPT-4 with CoT achieves a new SOTA on RadNLI, outperforming DoT5 by 10% in macro F1. It is observed that CoT greatly helps in this task especially for GPT-3.5.

**Chest ImaGenome Disease Classification**

As shown in Table 4, there is progressive improvement from text-davinci-003 to ChatGPT and then to GPT-4.
GPT-4 zero-shot performance is improved with 10-shot random in-context examples. A further slight improvement is achieved with similarity-based example selection.
(For other tasks in details, please feel free to read the paper directly.)

Brief Review — Exploring the Boundaries of GPT-4 in Radiology

Evaluation of GPT-4 on Text-Based Radiology Reports

Outline

1. Experimental Setup

2. Results

2.1. GPT-4 vs SOTA Radiology Models

2.2. Different Tasks in Details

Written by Sik-Ho Tsang

No responses yet