Brief Review — Extracting symptoms from free-text responses using ChatGPT among COVID-19 cases in Hong Kong
Extracting symptoms from free-text responses using ChatGPT among COVID-19 cases in Hong Kong
Extract COVID-19 Symptoms Using ChatGPT & GPT-4, by The Chinese University of Hong Kong, RMIT University, and Imperial College London
2024 Elsevier J. Clinical Microbiology and Infection (CMI) (Sik-Ho Tsang @ Medium)Medical/Clinical/Healthcare NLP/LLM
2017 … 2023 [MultiMedQA, HealthSearchQA, Med-PaLM] [Med-PaLM 2] [GPT-4 in Radiology] [ChatGPT & GPT‑4 on USMLE] [Regulatory Oversight of LLM] [ExBEHRT] [ChatDoctor] [DoctorGLM] [HuaTuo] 2024 [ChatGPT & GPT-4 on Dental Exam] [ChatGPT-3.5 on Radiation Oncology] [LLM on Clicical Text Summarization]
==== My Other Paper Readings Are Also Over Here ====
- Symptoms are extracted from 300 deidentified symptom narratives of COVID-19 patients by a computer-based matching algorithm (the standard), and prompt engineering in ChatGPT and GPT-4.
- GPT-4 achieved high specificity for all symptoms, high sensitivity for common symptoms, and moderate sensitivity for less common symptoms. Few-shot prompting increased the sensitivity and specificity.
- Its performance in converting symptom narratives to structured symptom labels was encouraging, saving time and effort in compiling the task-specific training data.
Outline
- Symptom Extraction
- Results
1. Symptom Extraction
- Symptoms are extracted from 300 deidentified symptom narratives of COVID-19 patients in Hong Kong. These narratives were selected from the COVID-19 case series in a previous analysis [5] based on stratified sampling with respect to languages (English, Chinese, and mixed) and number of free-text characters (5-20, 21-30, 31-50, and ≥51).
- Common symptoms were those with a prevalence >10% according to the standard, and similarly less common symptoms were those with a prevalence of 2-10%.
- There were two methods of symptom extraction: (i) a computer-based matching algorithm as described [5], and (ii) prompt engineering in ChatGPT:
1.1. Computer-Based Matching Algorithm
The narratives were iteratively matched to lists of lay-person symptom expressions (and were removed if successfully matched) until no meaningful text remained.
- Such lists were curated by the authors based on manual extraction of meaningful phrases.
- This reference standard was originally devised to extract symptoms for about 76,000 COVID-19 cases in Hong Kong and was validated by a manual review of 200 randomly selected cases.
1.2. Prompt Engineering in GPT-3.5 & GPT-4
- Considering the complexity of the task, the limited working memory, the character limit of the content window, and the length of the symptom narratives (5-400 characters), 10-40 narratives are handled in each conversation in ChatGPT.
- 18 conversations (six for each language) are initiated in each analysis (GPT-4, zero-shot; GPT-4, few-shot; GPT-3.5, few-shot).
2. Results
- The precision of ChatGPT was evaluated with sensitivity and specificity, the 95% binomial CIs (95% binCIs) of which were estimated using the Cloppere-Pearson exact method.
- Overall, ChatGPT recognized 54 symptoms among the 300 narratives.
2.1. Zero-Shot Prompting in GPT-4
- Using zero-shot prompting in GPT-4, among common symptoms, the sensitivity ranged from 0.853 (95% binCI: 0.689-0.950) (fatigue) to 1.000 (95% binCI: 0.951-1.000) (headache), and the specificity from 0.947 (95% binCI: 0.894-0.978) (sore throat) to 1.000 (95% binCI: 0.965-0.986, 1.000) (cough, fever, runny nose, and chills).
- Among less common symptoms, the sensitivity ranged from 0.200 (95% binCI: 0.043-0.481) (ostalgia) to 1.000 (95% binCI: 0.590-0.815, 1.000) (blocked nose, nausea or vomiting, disturbance of taste or smell, and voice disorder), and the specificity from 0.993 (95% binCI: 0.975-0.976, 0.999) to 1.000 (95% binCI: 0.987-0.988, 1.000).
2.1. Few-Shot Prompting in GPT-4
- Using few-shot prompting in GPT-4, the sensitivity was from 0.944 (95% binCI: 0.846-0.988) to 1.000 (95% binCI: 0.897-0.981, 1.000), and the specificity was from 0.985 (95% binCI: 0.946-0.998) to 1.000 (95% binCI: 0.965-0.986, 1.000) for common symptoms.
- The sensitivity was from 0.625 (95% binCI: 0.245-0.915) to 1.000 (95% binCI: 0.541-0.824, 1.000) and the specificity was from 0.976 (95% binCI: 0.952-0.990) to 1.000 (95% binCI: 0.987-0.988, 1.000) for less common symptoms.
Compared to GPT-3.5 using few-shot prompting, GPT-4 achieved higher sensitivity across almost all common and less common symptoms.