Brief Review — MedAlpaca — An Open-Source Collection of Medical Conversational AI Models and Training Data

MedAlpaca, Fine-Tuned from Alpaca

Sik-Ho Tsang
3 min readMay 23, 2024
Alpaca, They are going to become MedAlpaca (not Mad Alpaca) (Free image from Sarai Carrasco)

MedAlpaca — An Open-Source Collection of Medical Conversational AI Models and Training Data
, by University Hospital Aachen, Technical University of Munich, Berliner Hochschule für Technik (BHT), Universitätsmedizin Berlin
2023 arXiv v2, Over 110 Citations (Sik-Ho Tsang @ Medium)

Medical/Clinical/Healthcare NLP/LLM
20172023 [MultiMedQA, HealthSearchQA, Med-PaLM] [Med-PaLM 2] [GPT-4 in Radiology] [ChatGPT & GPT‑4 on USMLE] [Regulatory Oversight of LLM] [ExBEHRT] [ChatDoctor] [DoctorGLM] [HuaTuo] 2024 [ChatGPT & GPT-4 on Dental Exam] [ChatGPT-3.5 on Radiation Oncology] [LLM on Clicical Text Summarization]
==== My Other Paper Readings Are Also Over Here ====

  • An innovative dataset is presented, which consists of over 160,000 entries, specifically crafted to fine-tune LLMs for effective medical applications.
  • Specifically, Alpaca is fine-tuned as MedAlpaca.


  1. MedAlpaca
  2. Results

1. MedAlpaca

Medical Datasets
  • The dataset consists of 2 main categories, a collection of established medical NLP tasks reformatted in instruction tuning formats as well as a crawl of various internet resources.

1.1. Dataset 1: Flash Cards Used by Medical Students

  • Medicine as a whole encompasses a wide range of subjects that medical students and graduates must master in order to practice effectively. This includes a profound understanding of basic medical sciences, clinical knowledge, and clinical skills.
  • In the investigation, flashcards are leveraged as a source to create question-answer pairs for training purposes. Upon excluding cards containing images, OpenAI’s GPT-3.5-Turbo is harnessed to restructure the cards into coherent, contextually pertinent question-answer pairs.

1.2. Dataset 2: Stackexchange Medical Sciences

  • The stackexchange dataset consists of 52,475 question-answer pairs obtained from 5 Stack Exchange forums related to biomedical sciences and related fields: Academia, Bioinformatics, Biology, Fitness, Health.
  • Data is collected exclusively from responses that received a minimum of 5 up-votes within the forum discussions and paired them with their corresponding questions.

1.3. Dataset 3: Wikidoc

  • The platform has 2 main sub-sites, the “Living Textbook” and “Patient Information”.
  • The “Living Textbook” contains chapters for various medical specialties, which is crawled. GPT-3.5-Turbo is then used to rephrase the paragraph heading to a question and the paragraph is used as answers.
  • Patient Information is structured differently, in that each section subheading is already a question, making rephrasing obsolete.

1.4. Dataset 4: Medical NLP Benchmarks

  • The COVID-19 Open Research Dataset Challenge (CORD-19).
  • Measuring Massive Multitask Language Understanding (MMLU).
  • Training data from the MedQA benchmark.
  • Training data from the PubMed Causal Benchmark.
  • Conversational data from medical forums.
  • The OpenAssistant dataset.

2. Results

Test Set Performance United States Medical Licensing Examination (USMLE) Step 1, Step 2, and Step 3 self-assessment datasets.

Alpaca is fine-tuned from the LLaMA on 52K instruction-following demonstrations.

Alpaca 7B and 13B is fine-tuned as MedAlpaca for 5 epochs.

  • LoRA is used for fine-tuning with low rank matrix.
  • 8-bit matrix multiplication is used for the feed-forward and attention projection layers, along with an 8-bit optimizer.
  • All models trained with LoRA underwent 3 epochs.

Fine-tuned LLMs consistently surpassed the performance of their pre-trained-only counterparts. It is worth noting that while LoRA and 8-bit fine-tuning expedited the training process, employing these methods resulted in reduced accuracy.

  • A significant limitation is LLMs’ tendency to confabulate or generate text that appears plausible but is factually incorrect. This issue is especially concerning in the medical domain, where disseminating incorrect information can have serious implications for patient care and safety.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.