Temporal modelling of a patient’s medical history, which takes into account the sequence of past events, can be used to predict future events such as a diagnosis of a new disorder or complication of a previous or existing disorder.
MedGPT, a novel Transformer-based pipeline, is proposed that uses Named Entity Recognition (NER) and Linking tools (i.e. MedCAT) to structure and organize the free text portion of EHRs and anticipate a range of future medical events (initially disorders).
(Beaware when searching MedGPT in Internet, many products and services are dubbed as MedGPT nowadays.)
Outline
MedGPT
Datasets
Results
1. MedGPT
1.1. Model
MedGPT is built on top of the GPT-2, which uses causal language modeling (CLM).
Given a corpus of patients 𝑈 = {𝑢1, 𝑢2, u3, ...} where each patient is defined as a sequence of tokens 𝑢𝑖 = {𝑤1, w2, w3, …} and each token is medically relevant and temporally defined piece of patient data, the objective is the standard language modeling objective:
1.2. 8 Model Variants
8 different approaches on top of the base GPT-2 model, are tried.
Two EHR datasets were used: King’s College Hospital (KCH) NHS Foundation Trust, UK and MIMIC-III [10].
No preprocessing or filtering was done on the MIMIC-III dataset of clinical notes and all 2083179 free text documents were used directly.
At KCH, a total of 18436789 documents is collected. After filtering step 13084498 documents were left.
In brief, the Medical Concept Annotation Toolkit (MedCAT [11]) was used to extract disorder concepts from free text and link them to the SNOMED-CT concept database.
(Please read the paper for the sophisticated disorder extraction.)
The concepts were then grouped by patient and only the first occurrence of a concept was kept.
Without any filtering, there were 1121218 patients at KCH and 42339 at MIMIC-III, after removal of all disorders with frequency < 100 and all patients that have < 5 tokens, 582548 and 33975 patients were left respectively. The length of each sample/patient is limited to 50 tokens.
The resulting dataset was then split into a train/test set with an 80/20ratio. The train set was further split into a train/validationset with a 90/10 ratio.
3. Results
3.1. Model Variants
Performance of Model Variants
The combined “GLU+Rotary” MedGPT achieves the best results.
Finally, the MedGPT model, which consists of the GPT-2 base model with the GLU+Rotary extension, is tested on two datasets KCH and MIMIC-III.
3.2. Performance Comparison
Comparison with BoC SVM and LSTM
MedGPT outperforms BoC SVM and LSTM.
3.3. Qualitative Analysis
Qualitative Analysis
Example 1: This is a simple binary task which it performed well, consistent with medical literature.
Example 2: The background (cerebral aneurysm) provided the contextual cue for the rarer diagnosis which MedGPT successfully discerned.
Example 3: is used to test the longer attention. MedGPT also successfully handled the necessary indirect inference.
Example 4: Similar to above Example 3, attention in the presence of distractors were tested through intermixing historical diseases.