Brief Review — Med-BERT: pretrained contextualized embeddings on largescale structured electronic health records for disease prediction

Modifying BERT for Disease Prediction Using EHT

Sik-Ho Tsang
4 min readNov 1, 2023

Med-BERT: pretrained contextualized embeddings on largescale structured electronic health records for disease prediction
, by University of Texas Health Science Center at Houston, Peng Cheng Laboratory
2021 npj Digit. Med., Over 330 Citations (Sik-Ho Tsang @ Medium)

Medical LLM
2020 [BioBERT] [BEHRT] 2021 [MedGPT] 2023 [Med-PaLM]
==== My Other Paper Readings Are Also Over Here ====

  • Med-BERT is proposed, which adapts the BERT framework originally developed for the text domain to the structured EHR domain.
  • It is pretrained on a structured EHR dataset of 28,490,650 patients.
  • It is then fine-tuned on two disease prediction tasks from two clinical databases.


  1. Pretraining Cohort
  2. Med-BERT
  3. Results

1. Pretraining Cohort

Selection Pipeline
The Cohorts Details
  • The cohorts are from two databases: Cerner Health Facts® (version 2017) (Cerner) and Truven Health MarketScan® (Truven):
  1. Cerner is a de-identified EHR database that consists of over 600 hospitals and clinics in the United States. It represents over 68 million unique patients and includes longitudinal data from 2000 to 2017.
  2. Truven (version 2015) is a de-identified patient level claims dataset. It represents over 170 million patients from 2011 to 2015 from commercial insurance.
Example of Structured EHR Data

The structured EHR data of each patient is defined as a sequence of visits, each as a list of codes. The codes within a visit can be either ordered or unordered. If unordered, the EHR data for each patient can be reduced to a sequence of sets.

  • The priority of the diagnosis codes can be assessed as coded by billers, e.g., the primary diagnosis is mostly assigned the first priority followed by the second most important diagnosis and so on. In this case, it is ordered.

2. Med-BERT

2.1. Model Architecture


3 types of embeddings were taken as inputs for Med-BERT. These embeddings were projected from diagnosis codes, the order of codes within each visit, and the position of each visit and named, respectively, code embeddings, serialization embeddings, and visit embeddings.

  • Code embeddings are the low-dimensional representations of each diagnosis code; serialization embeddings denote the relative order, in our case, the priority order, of each code in each visit; and visit embeddings are used to distinguish each visit in the sequence.
  • No specific tokens [CLS] and [SEP] are used because the visit embeddings can separate well each visit.
  • Next sentence prediction is not used.

A feed-forward layer (FFL) or RNN prediction layer can be added on top to the sum of the output from all of the codes within visits.

2.2. Comparisons

Comparisons with BEHRT and G-BERT
  • For BEHRT, the authors’ definition of the area under the receiver operating characteristics (AUC), however, was a non-standard one, making it difficult to compare.
  • G-BERT’s inputs are all single-visit samples, which are insufficient to capture long-term contextual information in EHR.

3. Results

  • Evaluations are performed on two disease prediction tasks on three cohorts from two databases. The two tasks are DHF and PaCa. Cerner is used for both tasks, forming the DHF-Cerner and PaCa-Cerner cohort; and Truven is only used for the pancreatic cancer prediction task, forming the PaCa-Truven cohort, for generalizability evaluation.
Average AUC and Standard Deviations

For DHF-Cerner, it is notable that Bi-GRU + Med-BERT obtain the best results.

For PaCa-Cerner, similar trends also were observed.

On this Truven dataset, performance gains of 1.96–3.78% are still observed, although the average improved AUCs appear to be a bit lower than those on PaCa-Cerner.

For PaCa-Cerner, large improvements by adding Med-BERT to GRU and Bi-GRU were demonstrated for almost all training sizes.

Dependency Visualization

The above figure is an example of the attention pattern in the fourth layer of the Med-BERT model fine-tuned on the PaCa-Cerner dataset, capturing the relevant correlation between diagnosis codes.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.