Brief Review — Extracting symptoms from free-text responses using ChatGPT among COVID-19 cases in Hong Kong

Extract COVID-19 Symptoms Using ChatGPT & GPT-4

Sik-Ho Tsang
3 min readJun 13, 2024
CU Medicine finds from free-text narratives that COVID-19 symptoms change with virus mutations and vaccination status, and demonstrates AI large language models contribute to infectious disease research (Image from CUHK News)

Extracting symptoms from free-text responses using ChatGPT among COVID-19 cases in Hong Kong
Extract COVID-19 Symptoms Using ChatGPT & GPT-4, by The Chinese University of Hong Kong, RMIT University, and Imperial College London
2024 Elsevier J. Clinical Microbiology and Infection (CMI)
(Sik-Ho Tsang @ Medium)

Medical/Clinical/Healthcare NLP/LLM
20172023 [MultiMedQA, HealthSearchQA, Med-PaLM] [Med-PaLM 2] [GPT-4 in Radiology] [ChatGPT & GPT‑4 on USMLE] [Regulatory Oversight of LLM] [ExBEHRT] [ChatDoctor] [DoctorGLM] [HuaTuo] 2024 [ChatGPT & GPT-4 on Dental Exam] [ChatGPT-3.5 on Radiation Oncology] [LLM on Clicical Text Summarization]
==== My Other Paper Readings Are Also Over Here ====

  • Symptoms are extracted from 300 deidentified symptom narratives of COVID-19 patients by a computer-based matching algorithm (the standard), and prompt engineering in ChatGPT and GPT-4.
  • GPT-4 achieved high specificity for all symptoms, high sensitivity for common symptoms, and moderate sensitivity for less common symptoms. Few-shot prompting increased the sensitivity and specificity.
  • Its performance in converting symptom narratives to structured symptom labels was encouraging, saving time and effort in compiling the task-specific training data.

Outline

  1. Symptom Extraction
  2. Results

1. Symptom Extraction

  • Symptoms are extracted from 300 deidentified symptom narratives of COVID-19 patients in Hong Kong. These narratives were selected from the COVID-19 case series in a previous analysis [5] based on stratified sampling with respect to languages (English, Chinese, and mixed) and number of free-text characters (5-20, 21-30, 31-50, and ≥51).
  • Common symptoms were those with a prevalence >10% according to the standard, and similarly less common symptoms were those with a prevalence of 2-10%.
  • There were two methods of symptom extraction: (i) a computer-based matching algorithm as described [5], and (ii) prompt engineering in ChatGPT:

1.1. Computer-Based Matching Algorithm

The narratives were iteratively matched to lists of lay-person symptom expressions (and were removed if successfully matched) until no meaningful text remained.

  • Such lists were curated by the authors based on manual extraction of meaningful phrases.
  • This reference standard was originally devised to extract symptoms for about 76,000 COVID-19 cases in Hong Kong and was validated by a manual review of 200 randomly selected cases.

1.2. Prompt Engineering in GPT-3.5 & GPT-4

  • Considering the complexity of the task, the limited working memory, the character limit of the content window, and the length of the symptom narratives (5-400 characters), 10-40 narratives are handled in each conversation in ChatGPT.
  • 18 conversations (six for each language) are initiated in each analysis (GPT-4, zero-shot; GPT-4, few-shot; GPT-3.5, few-shot).

2. Results

  • The precision of ChatGPT was evaluated with sensitivity and specificity, the 95% binomial CIs (95% binCIs) of which were estimated using the Cloppere-Pearson exact method.
  • Overall, ChatGPT recognized 54 symptoms among the 300 narratives.

2.1. Zero-Shot Prompting in GPT-4

  • Using zero-shot prompting in GPT-4, among common symptoms, the sensitivity ranged from 0.853 (95% binCI: 0.689-0.950) (fatigue) to 1.000 (95% binCI: 0.951-1.000) (headache), and the specificity from 0.947 (95% binCI: 0.894-0.978) (sore throat) to 1.000 (95% binCI: 0.965-0.986, 1.000) (cough, fever, runny nose, and chills).
  • Among less common symptoms, the sensitivity ranged from 0.200 (95% binCI: 0.043-0.481) (ostalgia) to 1.000 (95% binCI: 0.590-0.815, 1.000) (blocked nose, nausea or vomiting, disturbance of taste or smell, and voice disorder), and the specificity from 0.993 (95% binCI: 0.975-0.976, 0.999) to 1.000 (95% binCI: 0.987-0.988, 1.000).

2.1. Few-Shot Prompting in GPT-4

  • Using few-shot prompting in GPT-4, the sensitivity was from 0.944 (95% binCI: 0.846-0.988) to 1.000 (95% binCI: 0.897-0.981, 1.000), and the specificity was from 0.985 (95% binCI: 0.946-0.998) to 1.000 (95% binCI: 0.965-0.986, 1.000) for common symptoms.
  • The sensitivity was from 0.625 (95% binCI: 0.245-0.915) to 1.000 (95% binCI: 0.541-0.824, 1.000) and the specificity was from 0.976 (95% binCI: 0.952-0.990) to 1.000 (95% binCI: 0.987-0.988, 1.000) for less common symptoms.

Compared to GPT-3.5 using few-shot prompting, GPT-4 achieved higher sensitivity across almost all common and less common symptoms.

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

No responses yet