Brief Review — Artificial intelligence-powered chatbots in search engines: a cross-sectional study on the quality and risks of drug information for patients
AI Chatbot Study on Drug Info for Patients
Artificial intelligence-powered chatbots in search engines: a cross-sectional study on the quality and risks of drug information for patients
AI Chatbot Study on Drug Info for Patients, by Friedrich-Alexander- Universität Erlangen-Nürnberg, GSK, Universitätsklinikum Erlangen
2024 BMJ Quality & Safety (Sik-Ho Tsang @ Medium)Healthcare/Medical LLM Evaluation
2023 [GPT-4 in Radiology] [ChatGPT & GPT‑4 on USMLE] 2024 [ChatGPT & GPT-4 on Dental Exam] [ChatGPT-3.5 on Radiation Oncology] [LLM on Clicical Text Summarization] [Extract COVID-19 Symptoms Using ChatGPT & GPT-4] [ChatGPT on Patients Medication]
My Healthcare and Medical Related Paper Readings and Tutorials
==== My Other Paper Readings Are Also Over Here ====
- (As suggested by my leader due to project needs, I have this paper read recently. This paper describe about the analysis of LLM on drug information query.)
- In this study, Bing copilot was queried on 10 frequently asked patient questions regarding the 50 most prescribed drugs in the US outpatient market.
- Readability of chatbot answers was assessed using the Flesch Reading Ease Score. Completeness and accuracy were evaluated based on corresponding patient drug information in the pharmaceutical encyclopaedia drugs. com.
- Likelihood and extent of possible harm are also evaluated if patients follow the chatbot’s given recommendations.
Outline
- Methodology
- Results
1. Methodology
- The 10 selected questions are referred to as ‘patient questions’ as above. 50 drugs are included in the study. Accordingly, 500 answers in total were generated and analysed.
1.1. Readability
- Readability of chatbot answers was assessed by calculating the ‘Flesch Reading Ease Score’. The Flesch Reading Ease Score is calculated based on total number of words, sentences, and syllables:
- This score estimates the educational level readers need to understand a particular text.
- 0-30: very difficult to read, necessitating a college graduate level;
- 31–50: difficult to read; 51–60: fairly difficult; 61–70: standard difficulty; 71–80: fairly easy; 81–90 as easy;
- 91–100 as very easy to read, appropriate for fifth graders or 11 year-olds.
1.2. Completeness and Accuracy
- For the assessment of completeness and accuracy of chatbot answers, a reference database was created by a clinical pharmacist (WA) and a physician with expertise in pharmacology (HFN).
- All statements were initially evaluated by a medical student (SMS) trained in pharmacology and reviewed by a clinical pharmacist (WA) and a physician with expertise in pharmacology (HFN).
- Completeness and accuracy for a chatbot answer were defined as:
- where matching_statements is the number of matching statements between the chatbot statements and the statements in the reference database, reference_statements is the number of statements in the reference database, true_statements is the number of true statements, partially true_statements is the number of partially true statements and false_statements is the number of false statements.
- Chatbot answers for each patient question for all 50 drugs were evaluated by calculating median and IQRs as well as arithmetic mean and sample SD of completeness and accuracy.
1.3. Scientific Consensus and Possible Patient Harm
- For this survey, a subset of 20 chatbot answers was selected by three authors (WA, SMS and HFN) based on their (1) low accuracy or (2) low completeness, or (3) answers posing a potential risk to patient safety.
- Consequently, 140 evaluations per criterion were carried out by 7 experts.
- Experts determined scientific consensus as ‘aligned’ or ‘opposed to scientific consensus’ based on whether the chatbot’s answer is congruent with the current scientific knowledge. In situations where the evidence was lacking or the recommendations were conflicting, ‘no consensus’ was presented.
- For the extent of harms, scale levels are used based on the ‘Agency for Healthcare Research and Quality’ (AHRQ) harm scales for rating patient safety events, allowing for an evaluation of ‘no harm’, ‘mild or moderate harm’ and ‘death or severe harm’.
- Regardless of the extent of possible harm, the likelihood of possible harm resulting from chatbot answers was estimated by experts to be either ‘high’, ‘medium’, ‘low’ or ‘no harm’ at all.
2. Results
2.1. Readability
The overall mean (SD) Flesch Reading Ease Score of chatbot answers was 37.2 (17.7), indicating an overall high educational level necessitated by the reader (college level).
- The chatbot answers to patient question 10 had the lowest readability, with a mean (SD) of 26.5 (13.9), corresponding to a college graduate’s educational level.
- The highest readability of chatbot answers still required an educational level of 10th–12th grades (high school) and was observed for patient question 4, with a mean (SD) of 59.2 (16.3).
2.2. Completeness
Overall median completeness of chatbot answers was 100.0% (IQR 50.0–100.0%) with a mean (SD) of 76.7% (32.0%).
- In 11 of 495 (2.2%) chatbot answers, presented statements did not match with any statements in the reference database, resulting in a completeness of 0.0%. Of these, six chatbot answers related to question 6 and three to question 3.
2.3. Accuracy
Accuracy assessment for chatbot answers revealed an overall median accuracy of 100.0% (IQR 88.1–100.0%) with a mean (SD) of 88.7% (22.3%).
- Chatbot statements were inconsistent with the reference database in 126 of 484 (26.0%) chatbot answers, that is, had an accuracy <100.0%, and were fully inconsistent in 16 of 484 (3.3%) chatbot answers, that is, had an accuracy of 0.0%.
A substantial factor contributing to incompleteness or inaccuracy in chatbot answers was the chatbot’s inability to address the underlying intent of a patient question.
- Not all inaccuracies or missing information pose an immediate or significant risk.
2.4. Scientific Consensus and Possible Patient Harm
- Scientific Consensus: Only 54% (95% CI 35% to 70%) of the subset of chatbot answers were rated to be aligned with scientific consensus. Conversely, 39% (95% CI 25% to 55%) of these chatbot answers were found to oppose the scientific consensus, while for the remaining 6% (95% CI 0% to 15%) there was no established scientific consensus.
It is noted that the alignment of scientific consensus was not necessarily correlated with the absence of potential harm. This might be attributed to missing information in chatbot answers.
- Likelihood of Harm: A possible harm resulting from a patient following chatbot’s advice was rated to occur with a high likelihood in 3% (95% CI 0% to 10%) and a medium likelihood in 29% (95% CI 10% to 50%) of the subset of chatbot answers. 34% (95% CI 15% to 50%) of chatbot answers were judged as either leading to possible harm with a low likelihood or leading to no harm at all, respectively
- Extent of Harm: 42% (95% CI 25% to 60%) of these chatbot answers were considered to lead to moderate or mild harm and 22% (95% CI 10% to 40%) to death or severe harm. Correspondingly, 36% (95% CI 20% to 55%) of chatbot answers were considered to lead to no harm.
2.5. Comparison with Existing Literature
- Some studies observed frequently incorrect and harmful answers on querying ChatGPT about drug-related questions.
- Some observed limited completeness and varying accuracy and consistency, especially in regard of patient medication education-related questions.
But none of them explicitly analysed patients’ perspective on potential benefits and risks of chatbots and AI-powered search engines.
2.6. Technical Requirements and Implications for Clinical Practice
- For generation of the chatbot answer, both reliable and unreliable sources can be cited.
- Even though the chatbot cites sources in its answers, only whole websites instead of specific paragraphs are referenced. This can lead to incorrectly merged information creating statements not originally stated in the primary source.
- Chatbot lacks the capability to verify their currentness possibly leading to obsolete information in a chatbot answer.
- An AI-powered chatbot, which occupies these technical flaws and is solely built on a stochastic model, will always pose a risk to generate non-sensical or untruthful content possibly leading to harm. Accordingly, a clear disclaimer should be mandatorily displayed indicating that the information provided by a chatbot is not intended to replace professional advice.
An LLM processing the patient’s language followed by a search and literal citation limited to validated patient information databases or dictionaries could represent a suitable approach for a patient-safe AI-powered search engine. Such an error-free AI-powered search engine could especially be beneficial to patients with limited access to medicines information.