Brief Review — Accuracy of a Chatbot in Answering Questions that Patients Should Ask Before Taking a New Medication

ChatGPT on Patients Medication

Sik-Ho Tsang
4 min readJun 18, 2024
(Free Image from Pexels: Pixabay)

Accuracy of a Chatbot in Answering Questions that Patients Should Ask Before Taking a New Medication
ChatGPT on Patients Medication, by University of Arizona
2024 JAPH (Sik-Ho Tsang @ Medium)

Medical/Clinical/Healthcare NLP/LLM
20172023 [MultiMedQA, HealthSearchQA, Med-PaLM] [Med-PaLM 2] [GPT-4 in Radiology] [ChatGPT & GPT‑4 on USMLE] [Regulatory Oversight of LLM] [ExBEHRT] [ChatDoctor] [DoctorGLM] [HuaTuo] 2024 [ChatGPT & GPT-4 on Dental Exam] [ChatGPT-3.5 on Radiation Oncology] [LLM on Clicical Text Summarization] [Extract COVID-19 Symptoms Using ChatGPT & GPT-4]
==== My Other Paper Readings Are Also Over Here ====

  • I come across this paper as my project leaders have shared this paper recently to have a look due to the project’s need.
  • In this paper, authors evaluate the accuracy of answers provided by a chatbot (ChatGPT) in response to questions that patients should ask before taking a new medication.


  1. ChatGPT on Patients Medication
  2. Results

1. ChatGPT on Patients Medication

1.1. Method

12 Questions from AHRQ
  • 12 questions obtained from the Agency for Healthcare Research and Quality (AHRQ) were asked to a chatbot (ChatGPT) for the top 20 drugs.
  • The top 20 drugs are listed below wih issues noted:
Top 20 Drugs (There are quite a number of 3-hypers drugs)

Therefore, 12 questions were asked for 20 medications generating 240 individual responses from the model.

1.2. Correctness and Completeness

  • 2 reviewers independently evaluated and rated each response on a 6-point scale for correctness and a 3-point scale for completeness with a score of 2 considered adequate.
  • 6-point correctness scale (1 = completely incorrect; 2 = more incorrect than correct; 3 = approximately equal correct and incorrect; 4 = more correct than incorrect; 5 = nearly all correct; and 6 = completely correct) Accuracy was determined using clinical expertise and a drug information database.
  • 3-point completeness scale (1 = incomplete [addresses some aspects of the question, but significant parts are missing or incomplete]; 2 = adequate [addresses all aspects of the question and provides the minimum amount of information required to be considered complete]; and 3 = comprehensive [addresses all aspects of the question and provides additional information or context beyond what was expected]).
  • After the independent reviews, the 2 reviewers met to compare answers and discuss any discrepancies and assign a consensus score for correctness and completeness.

1.3. Reproducibility

  • To assess for reproducibility, responses that were scored as less than 6 for correctness were reassessed 14 days. They can be either improve in correctness, have no change, or decrease in correctness.

2. Results

2.1. Correctness

Out of 240 responses, 222 (92.5%) were assessed as completely correct. Of the incorrect responses, 10 (4.2%) provided information that was nearly all correct, 5 (2.1%) more correct than incorrect, 2 (0.8%) were equal parts correct and incorrect, 1 (0.4%) was more incorrect than correct, and none (0%) were completely incorrect.

2.2. Completeness

Of the 240 responses, 194 (80.8%) were comprehensively complete. A score of 2 was considered adequate, and 235 (97.9%) scored 2 or higher indicating at least an adequate level of completeness. 5 (2.1%) were considered incomplete.

2.3. Reproducibility

When the 18 items that scored low for correctness, they need to be reassessed, responses were scored the same as the initial query for 6 items, decreased in quality for 5 items, and improved in quality for 7 items.

  • The median correctness score was 5 (IQR 5–4) with the initial query and 4.5 (IQR 6–2) (p=0.64) in the repeat query.

2.4. Discussions & Limitations

This raises some concerns regarding the consistency of accuracy if chatbots are used in the clinical setting.

  • Pharmacists are uniquely trained to counsel patients on the most important aspects of a medication. They are not required to have a question prompted correctly by a patient to cover the counseling point while ChatGPT needs.

If a chatbot is used as a source to inquire about medication information, specific, singular prompts may be required for accurate responses.

  • With only 2 investigators, there is a risk for personal bias and subjective interpretations.
  • An additional limitation to this study is that it was conducted using English only.
  • Utilizing a chatbot to answer questions commonly asked by patients is mostly accurate but may include or lack valuable information for patients. Educating patients is important.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.