Review — Quality of Large Language Model Responses to Radiation Oncology Patient Care Questions

ChatGPT-3.5 on Radiation Oncology Patient Care Questions

Sik-Ho Tsang
5 min readApr 4, 2024
(Free Image from

Quality of Large Language Model Responses to Radiation Oncology Patient Care Questions
ChatGPT-3.5 on Radiation Oncology
, by Northwestern University Feinberg School of Medicine
2024 JAMA Netw. Open (Sik-Ho Tsang @ Medium)

Medical/Clinical/Healthcare NLP/LLM
2023 [MultiMedQA, HealthSearchQA, Med-PaLM] [Med-PaLM 2] [GPT-4 in Radiology] [ChatGPT & GPT‑4 on USMLE] [Regulatory Oversight of LLM] [ExBEHRT] [ChatDoctor] [DoctorGLM] [HuaTuo] 2024 [ChatGPT & GPT-4 on Dental Exam]
==== My Other Paper Readings Are Also Over Here ====

  • ChatGPT-3.5 LLM is used on 115 radiation oncology questions.
  • 3 radiation oncologists and 3 radiation physicists ranked the LLM-generated responses for relative factual correctness, relative completeness, and relative conciseness compared with online expert answers.


  1. ChatGPT-3.5 on Radiation Oncology
  2. Results

1. ChatGPT-3.5 on Radiation Oncology

  • LLM has shown its performance on answering medical test questions, simplifying radiology reports, and searching for cancer information, potentially reducing workload and optimizing performance.
  • Yet, LLM also has the hallucination issue, which generates factually inaccurate responses.

In this paper, relative factual correctness, relative completeness, and relative conciseness of ChatGPT-3.5 on Radiation Oncology Patient Care Questions are studied.

An ideal answer would contain all clinically significant information
(completeness) without any errors (correctness)
or superfluous information (conciseness) (These 3 metrics are not fully described/defined in the paper.)

1.1. Question-Answer (QA) Dataset

Question-answer resources from the websites of 4 large oncology and radiation oncology groups were assessed (from February 1 to March 20, 2023). These included:

  1., sponsored by the Radiological Society of North America (RSNA) and the American College of Radiology (ACR);
  2. from the American Society for Radiation Oncology (ASTRO);
  3. from the National Cancer Institute (NCI) at the National Institutes of Health (NIH); and
  4. from the American Society of Clinical Oncology (ASCO).
  • The common patient questions retrieved from and were divided into 3 thematic categories: general radiation oncology, treatment modality–specific, and cancer subsite–specific questions.

A database was compiled to include 29 general radiation oncology questions from; 45 treatment modality–specific questions and 41 cancer subsite–specific questions from

1.2. Method

Questions were then entered into the LLM chatbot ChatGPT-3.5 (OpenAI), accessed February 20 to April 20, 2023.

  • The exact wording from and was input into the LLM, except in cases where information subheadings on the websites were not provided in a question format.

1.3. Evaluation Metrics

1.3.1. Domain-Specific Metrics

  • A Turing test–like approach is used.
  • The LLM-generated responses were assessed for relative factual correctness, relative completeness, and relative conciseness and organization by 3 radiation oncologists and 3 radiation physicists.
  • A 5-point Likert scale (1: “much worse,” 2: “somewhat worse,” 3: “the same,” 4: “somewhat better,” and 5: “much better”) was used to evaluate the degree of agreement for the 3 evaluation metrics.
  • A fourth metric, potential harm, was also evaluated using a 5-point Likert scale (0: “not at all,” 1: “slightly,” 2:“moderately,” 3: “very,” and 4: “extremely”).
  • Expert answers and the LLM-generated answers were then compared using cosine similarity. Augmented Sentence BERT is used to encode both answers into 2 vectors to calculate the cosine similarity.

1.3.2. Domain-Agnostic Metrics

  • To assess the readability of the content, a readability analysis was performed using 10 major readability assessment scales commonly used to evaluate the readability of medical literature. These 10 numeric scales included the Flesch Reading Ease, New Fog Count, Flesch-Kincaid Grade Level, Simple Measure of Gobbledygook, Coleman-Liau Index, Gunning Fog Index, FORCAST Formula, New Dale-Chall, Fry Readability, and Raygor Readability Estimate. A combined readability consensus score, which correlates with the grade level, was determined from these 10 scales.
  • 3 additional analyses of word count, lexicon, and syllable count were performed for each expert and LLM-derived answer.

2. Results

A: Out of 115 radiation oncology questions retrieved from 4 professional society websites, 113 (99%) of ChatGPT (the LLM) responses posed no potential harm.

B: Of 115 total questions retrieved from professional society websites, the LLM performed the same or better on 108 responses (94%) in relative correctness, 89 responses (77%) in completeness, and 105 responses (91%) in conciseness compared with expert responses.

  • C to E: Results for 3 categories.
The treatment modality–specific answers encompassed 8 subcategories including external beam radiotherapy, linear accelerator,magnetic resonance imaging-guided linear accelerator (MR-LINAC), Gamma Knife, stereotactic radiosurgery (SRS), and stereotactic body radiotherapy (SBRT), intensity-modulated radiotherapy (IMRT), proton beam radiation therapy (PBT), and image-guided radiotherapy (IGRT).
  • Within each category, the LLM was ranked as demonstrating the same, somewhat better, or much better conciseness for a range of 71% to 100% of questions; same, somewhat better, or much better completeness for 33% to 100% of questions; and same, somewhat better, or much better factual correctness for 75% to 100% of questions.

Notably, the LLM responses related to “Gamma Knife” and “SRS and SBRT” had at least 50% of the LLM answers ranked as somewhat worse or much worse completeness than expert answers.

Subsite-specific answers encompassed 11 subcategories, including colorectal, lung, breast, brain, head and neck, prostate, esophageal, pancreas, anal, gynecologic, and thyroid cancers.
  • Within the subsites, the percentage of answers ranked as same, somewhat better, or much better ranged from 75%to 100% for relative factual correctness, 50% to 100% for relative completeness, and 75% to 100% for relative conciseness.
Distribution plots of domain-agnostic metrics, including syllable count, word count, lexicon scores, readability consensus, and cosine similarity

The mean (SD) cosine similarity between expert and LLM responses for all questions was 0.75, meaning that the LLM answers also had a high degree of similarity to expert answers.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.