Brief Review — Extracting symptoms from free-text responses using ChatGPT among COVID-19 cases in Hong Kong

Extract COVID-19 Symptoms Using ChatGPT & GPT-4

3 min readJun 13, 2024

CU Medicine finds from free-text narratives that COVID-19 symptoms change with virus mutations and vaccination status, and demonstrates AI large language models contribute to infectious disease research (Image from CUHK News)

Extracting symptoms from free-text responses using ChatGPT among COVID-19 cases in Hong Kong
Extract COVID-19 Symptoms Using ChatGPT & GPT-4, by The Chinese University of Hong Kong, RMIT University, and Imperial College London
2024 Elsevier J. Clinical Microbiology and Infection (CMI) (
Sik-Ho Tsang
@ Medium)
Medical/Clinical/Healthcare NLP/LLM
2017 … 2023 [MultiMedQA, HealthSearchQA, Med-PaLM] [Med-PaLM 2] [GPT-4 in Radiology] [ChatGPT & GPT‑4 on USMLE] [Regulatory Oversight of LLM] [ExBEHRT] [ChatDoctor] [DoctorGLM] [HuaTuo] 2024 [ChatGPT & GPT-4 on Dental Exam] [ChatGPT-3.5 on Radiation Oncology] [LLM on Clicical Text Summarization]
==== My Other Paper Readings Are Also Over Here ====

Symptoms are extracted from 300 deidentified symptom narratives of COVID-19 patients by a computer-based matching algorithm (the standard), and prompt engineering in ChatGPT and GPT-4.
GPT-4 achieved high specificity for all symptoms, high sensitivity for common symptoms, and moderate sensitivity for less common symptoms. Few-shot prompting increased the sensitivity and specificity.
Its performance in converting symptom narratives to structured symptom labels was encouraging, saving time and effort in compiling the task-specific training data.

Outline

Symptom Extraction
Results

1. Symptom Extraction

Symptoms are extracted from 300 deidentified symptom narratives of COVID-19 patients in Hong Kong. These narratives were selected from the COVID-19 case series in a previous analysis [5] based on stratified sampling with respect to languages (English, Chinese, and mixed) and number of free-text characters (5-20, 21-30, 31-50, and ≥51).
Common symptoms were those with a prevalence >10% according to the standard, and similarly less common symptoms were those with a prevalence of 2-10%.
There were two methods of symptom extraction: (i) a computer-based matching algorithm as described [5], and (ii) prompt engineering in ChatGPT:

1.1. Computer-Based Matching Algorithm

The narratives were iteratively matched to lists of lay-person symptom expressions (and were removed if successfully matched) until no meaningful text remained.

Such lists were curated by the authors based on manual extraction of meaningful phrases.
This reference standard was originally devised to extract symptoms for about 76,000 COVID-19 cases in Hong Kong and was validated by a manual review of 200 randomly selected cases.

1.2. Prompt Engineering in GPT-3.5 & GPT-4

Considering the complexity of the task, the limited working memory, the character limit of the content window, and the length of the symptom narratives (5-400 characters), 10-40 narratives are handled in each conversation in ChatGPT.
18 conversations (six for each language) are initiated in each analysis (GPT-4, zero-shot; GPT-4, few-shot; GPT-3.5, few-shot).

2. Results

The precision of ChatGPT was evaluated with sensitivity and specificity, the 95% binomial CIs (95% binCIs) of which were estimated using the Cloppere-Pearson exact method.
Overall, ChatGPT recognized 54 symptoms among the 300 narratives.

2.1. Zero-Shot Prompting in GPT-4

Using zero-shot prompting in GPT-4, among common symptoms, the sensitivity ranged from 0.853 (95% binCI: 0.689-0.950) (fatigue) to 1.000 (95% binCI: 0.951-1.000) (headache), and the specificity from 0.947 (95% binCI: 0.894-0.978) (sore throat) to 1.000 (95% binCI: 0.965-0.986, 1.000) (cough, fever, runny nose, and chills).
Among less common symptoms, the sensitivity ranged from 0.200 (95% binCI: 0.043-0.481) (ostalgia) to 1.000 (95% binCI: 0.590-0.815, 1.000) (blocked nose, nausea or vomiting, disturbance of taste or smell, and voice disorder), and the specificity from 0.993 (95% binCI: 0.975-0.976, 0.999) to 1.000 (95% binCI: 0.987-0.988, 1.000).

2.1. Few-Shot Prompting in GPT-4

Using few-shot prompting in GPT-4, the sensitivity was from 0.944 (95% binCI: 0.846-0.988) to 1.000 (95% binCI: 0.897-0.981, 1.000), and the specificity was from 0.985 (95% binCI: 0.946-0.998) to 1.000 (95% binCI: 0.965-0.986, 1.000) for common symptoms.
The sensitivity was from 0.625 (95% binCI: 0.245-0.915) to 1.000 (95% binCI: 0.541-0.824, 1.000) and the specificity was from 0.976 (95% binCI: 0.952-0.990) to 1.000 (95% binCI: 0.987-0.988, 1.000) for less common symptoms.

Compared to GPT-3.5 using few-shot prompting, GPT-4 achieved higher sensitivity across almost all common and less common symptoms.

Brief Review — Extracting symptoms from free-text responses using ChatGPT among COVID-19 cases in Hong Kong

Extract COVID-19 Symptoms Using ChatGPT & GPT-4

Outline

1. Symptom Extraction

1.1. Computer-Based Matching Algorithm

1.2. Prompt Engineering in GPT-3.5 & GPT-4

2. Results

2.1. Zero-Shot Prompting in GPT-4

2.1. Few-Shot Prompting in GPT-4

Written by Sik-Ho Tsang