Brief Review — HuaTuo: Tuning LLaMA Model with Chinese Medical Knowledge

HuaTuo (华驼) LLM

Sik-Ho Tsang
3 min readMar 12, 2024
HuaTuo (华驼)

HuaTuo: Tuning LLaMA Model with Chinese Medical Knowledge
, by Harbin Institute of Technology
2023 arXiv v1, Over 40 Citations (

@ Medium)

Medical/Clinical/Healthcare NLP/LLM
2017 [LiveQA] … 2023 [MultiMedQA, HealthSearchQA, Med-PaLM] [Med-PaLM 2] [GPT-4 in Radiology] [ChatGPT & GPT‑4 on USMLE] [Regulatory Oversight of LLM] [ExBEHRT]
==== My Other Paper Readings Are Also Over Here ====

  • HuaTuo (华驼), a LLaMA-based model, is proposed that has been supervised-fine-tuned with generated QA (Question-Answer) instances.


  1. HuaTuo
  2. Results

1. HuaTuo

1.1. Dataset


LLaMA-7B model is adopted as the base model.

A Chinese medical knowledge graph, CMeKG (Odmaa et al., 2019), which also provides retrieved medical knowledge about diseases, drugs, symptoms, etc. The above table shows several knowledge cases in the CMeKG knowledge base.

1.2. Instruction Data

  • Instruct-tuning involves supervised fine-tuning on the training instances and an instruction that describes the tasks in natural language.
  • Instruction data is generated based on the above medical knowledge.

However, as for a LLM for medical dialogue, inputs are mostly stated as questions and instructions are all like “Answer the following question”. Therefore, the instructions are discarded and only the inputs are preserved for HuaTuo.

Thus, knowledge instances are first sampled from the knowledge graph and then the instances are generated based on the specific knowledge with the OpenAI API (OpenAI, 2022).

Finally, over 8,000 instruction data are collected, like examples in Table 3 as training instances for supervised fine-tuning.

2. Results

2.1. SUS Metric & Test Set

  • For medical QA tasks, 3 dimensions are concerned:
  1. Safety determines whether the response includes anything that can mislead the user into danger, such as wrong medicine recommendations.
  2. Usability reflects the medical expertise of a specific response.
  3. And, the Smoothness represents the ability as a language model.
  • The SUS metric consists of 3 dimensions: Safety, Usability, and Smoothness
  • A test set of potential questions in Chinese dialogue scenarios is constructed.
  • 5 annotators with medical backgrounds are recruited who assessed the randomly mixed responses of the models using a three-point scale for each dimension of Safety, Usability, and Smoothness (SUS), ranges from 1 (not acceptable) to 3 (good).
SUS Score
  • Although LLaMA achieves the highest safety score, its responses are often uninformative, resulting in a low usability score.

On the other hand, the proposed HuaTuo model significantly improves knowledge usability without much compromising safety.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.