Brief Review — DoctorGLM: Fine-tuning your Chinese Doctor is not a Herculean Task

DoctorGLM for Medical Chatbot in Chinese Language

Sik-Ho Tsang
4 min readFeb 29, 2024
DoctorGLM (Image from Authors’ GitHub)

DoctorGLM: Fine-tuning your Chinese Doctor is not a Herculean Task
DoctorGLM, by ShanghaiTech University; Shanghai Jiao Tong University; United Imaging Intelligence; and Huashan Hospital, Fudan University
2023 arXiv v2, Over 50 Citations (Sik-Ho Tsang @ Medium)

Medical/Clinical/Healthcare NLP/LLM
2017 [LiveQA] … 2023 [MultiMedQA, HealthSearchQA, Med-PaLM] [Med-PaLM 2] [GPT-4 in Radiology] [ChatGPT & GPT‑4 on USMLE] [Regulatory Oversight of LLM] [ExBEHRT] [ChatDoctor]
==== My Other Paper Readings Are Also Over Here ====

  • Databases of medical dialogues in Chinese are collected with ChatGPT’s help and several techniques are adopted to train an easy-deploy LLM.
  • Remarkably, ChatGLM-6B (from GLM, GLM-130B) is fine-tuned on a single A100 80G in 13 hours, which means having a healthcare-purpose LLM can be very affordable.


  1. DoctorGLM
  2. Results

1. DoctorGLM

1.1. Dataset with ChatGPT’s Help

  • There exists a lot of high-quality datasets released in English. Yet, Chinese datasets are rare and needed in this study.
  • X = {x1, x2, …, xN} is initially selected from the ChatDoctor dataset, where xn is the raw English text, and corresponding high-quality translation Y = {y1, y2, …, yN} is obtained through ChatGPT API.
  • Then, a BART-based pre-trained model is fine-tuned solely on paired X and Y without any additional datasets. In this way, the language model can distill the expert-level knowledge from ChatGPT and the refined small model can act as an acceptable alternative to LLMs.

1.2. Prompt Designer

  • Large language models have achieved remarkable performance in conversational tasks. However, their outputs may be unreliable and deceptive.
  • A prompt designer module is used that pre-processes the user’s input.

The prompt designer module extracts relevant keywords such as the name of the disease or symptoms from the user’s input. The module then utilizes the name of the most likely disease as a label and generates a brief description based on a professional disease knowledge library.

  • In particular, 3231 disease documents are used in detail, all of which are sourced from the Merck Manual of Diagnosis and Therapy.
  • The prompt designer’s output includes information about the disease’s symptoms, diagnosis, treatment options, and preventive measures.

By providing a professionally generated prompt, the prompt designer expands the expertise and reliability of DoctorGLM for a particular disease. The generated prompt is integrated into the large language model, along with the original input, to improve the accuracy and reliability of DoctorGLM’s responses.

1.3. Training

Training Datasets
  • ChatGLM-6B (from GLM, GLM-130B) model is used.
  • The model was trained on approximately 1 trillion tokens of Chinese and English corpus, with additional supervised fine-tuning, feedback bootstrap, and reinforcement learning using human feedback.
  • P-tuning [8], which is a method for fine-tuning large language models that involves optimizing only continuous prompts, significantly reducing storage and memory usage per task. P-tuning performs comparably to fine-tuning with only 0.1%-3% of the fine-tuning parameters. (Hope I can read P-tuning in the future.)
  • DoctorGLM’s training process can handle approximately 80,000 single question and answer pairs per hour per GPU.
  • Assuming the cloud computing server of an A100 GPU is approximately 5 USD per hour, the total training time required is 3.75 hours, which translates to a cost of approximately 18.75 USD for fine-tuning a DoctorGLM on 100,000 QA pairs.
  • The inference process for DoctorGLM requires only about 13 GB of GPU memory.

2. Results

  • (More examples are shown in the paper.)



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.