# Review — G-BERT: Pre-training of Graph Augmented Transformers for Medication Recommendation

## G-BERT, Using Ontology Information and **BERT**

Pre-training of Graph Augmented Transformers for Medication Recommendation, by IQVIA, IBM Research AI, Georgia Institute of Technology,

G-BERT2019 ICJAI, Over 190 Citations(Sik-Ho Tsang @ Medium)

Medical LLM2020[BioBERT] [BEHRT]2021[MedGPT]2023[Med-PaLM]

==== My Other Paper Readings Are Also Over Here ====

# Outline

**Problem Formulation****G-BERT****Results**

**1. Problem Formulation**

## 1.1. Longitudinal EHR Data

- In longitudinal Electronic Health Record (EHR) data,
**each patient can be represented as a sequence of multivariate observations:**

- where
*n*is from 1 to*N*.is the*N***total number of patients**;is the*T*(*n*)**number of visits of the**. Here*n*-th patient**two main medical codes**are chosen to represent**each visit**of a patient which is a*Xt*=*Ctd*∪*Ctm***union**set of corresponding**diagnoses codes***Ctd*⊂*Cd***medications codes**.*Ctm*⊂*Cm* - For simplicity,
*Ct** is used to indicate the unified definition for different type of medical codes and the superscript (*n*) is dropped for a single patient whenever it is unambiguous. denotes the*C****medical code set**and |*C**| the size of the code set.∈*c***C** is the**medical code.**

## 1.2. Medical Codes

**Medical codes**are usually categorized according to a**tree-structured classification system**such as**ICD-9 ontology for diagnosis**and**ATC ontology is for medication.***Od*,*Om*to denote the ontology for diagnosis and medication.*O****unified definition for different type of medical codes.**- Two functions
are defined, which accept target medical code and return*pa*(.),*ch*(.)**ancestors’ code****set**and**direct child code set.**

## 1.3. Problem Definition (Medication Recommendation)

- Given
**diagnosis codes**,*Ctd*of the visit at time*t***patient history**:*X*1:*t*

we want to recommend multiple medications by

generating multi-label output ^yt∈ {0, 1}^|Cm|.

# 2. G-BERT

Conceptually, an enhanced ontology embedding by GNN is input to BERT for pretraining. Then, BERT is fine-tuned for downstream task.

## 2.1. Ontology Embedding

**Ontology embedding is constructed from diagnosis ontology**Since the*Od*and medication ontology*Om*.**medical codes**in raw EHR data can be considered as**leaf nodes in these ontology trees**, the medical code embedding**can be enhanced using graph neural networks (GNNs)**to**integrate the ancestors’ information**of these codes.- A
**two-stage procedure**is performed with a specially designed GNN for ontology embedding. - To start, an
**initial embedding vector**is assigned to every medical code*c** ∈*O** with a**learnable embedding matrix**with*We**d*is the embedding dimension.

Stage 1: For each non-leaf nodec*, itsenhanced medical embedding:hc*

- where
*g*( , , ) is an aggregation function which accepts the target medical code*c**, its direct child codes*ch*(*c**) and initial embedding matrix.

Intuitively, the aggregation function can

pass and fuse informationin target node from its direct children which result in themore related embedding of ancestor’ code to child codes’ embedding.

Stage 2:After obtaining enhanced embeddings, the enhance embedding matrixHe is used topass back toget ontology embedding for leaf codesoc*:

*g*( , , ) can be as simple as sum or mean. Here,*g*( , , ) is defined as follows (taking stage 2 for an example):

- where
**||****concatenation**which enables the multihead attention mechanism,**σ**is the**activation function**,*Wk*is the weight matrix,*αki*,*j*computed:*k*-th normalized attention coefficients

## 2.2. Pretraining G-BERT

The model takes the above

ontology embeddingasinputandderive visit embedding vt* for a patient att-th visit:

- where [CLS] is a special token as in BERT. It is put in the first position of each visit of type *.
- One big difference between language sentences and EHR sequences is that the medical codes within the same visit do not generally have an order, so the
**position embedding is removed**.

## 2.2.1. Self-Prediction Task

- This task is to
**recover the visit embedding**what it is made of, i.e., the input medical codes*v*1**Ct** for each visit as follows:

- The
**binary cross entropy loss***Ls**f*(*v**)) should be transformed by applying a fully connected neural network*f*( ) with one hidden layer. - Similar to BERT,
**15%**codes in*C** is**masked randomly**.

## 2.2.2. Dual-Prediction Task

- Note again
**ICD-9 ontology is for diagnosis (d)**and**ATC ontology is for medication (m).** - In medication recommendation, multiple medications can be predicted given only the diagnosis codes. Inversely, unknown diagnosis can also be predicted given the medication codes.

## 2.2.3. Overall Loss Function

- Finally, the loss below is used to train on EHR data from all patients who only have one hospital visits:

## 2.3. Fine-Tuning G-BERT

**The known diagnosis codes**is also represented using the same model as*Ctd*at the prediction time*t**vt**.

Concatenating the mean of previous diagnoses visit embeddings and medication visit embeddings, also the last diagnoses visit embedding, anMLP based prediction layeris built on top topredict the recommended medication codes as:

- Given the true labels ^
*yt*at each time stamp*t*, the loss function for the whole EHR sequence (i.e. a patient) is:

# 3. Results

## 3.1. Dataset & Some Training Details

- EHR data from
**MIMIC-III**[Johnson et al., 2016] is used. The drug coding is transformed from NDC to ATC Third Level for using the ontology information. Dataset is split into**training, validation and testing**set in a**0.6:0.2:0.2**ratio. - GNN: Input embedding dimension is 75, number of attention heads is 4. BERT: hidden dimension is 300, dimension of position-wise feed-forward networks is 300, 2 hidden layers with 4 attention heads for each layer.
- Specially, authors
**alternated**the**pre-training with 5 epochs**and**fine-tuning procedure with 5 epochs**for**15 times**to stabilize the training procedure. (This training procedure is quite special to me.)

## 3.2. Results

**G-**: Use medical embedding without ontology information.**P-**: No pretraining.- By comparing the last 4 rows,
**ontology information and pretraining are both important.**

The final model

G-BERTisbetterthan the attention based model,RETAIN, and the recently published state-of-the-art model,GAMENet. Specifically, even adding the extra information of DDI knowledge and procedure codes, GAMENet still performs worse than G-BERT.