Brief Review — ExBEHRT: Extended Transformer for Electronic Health Records

ExBEHRT, Outperforms BEHRT

Sik-Ho Tsang
5 min readFeb 20, 2024

ExBEHRT: Extended Transformer for Electronic Health Records
, by Novartis Oncology AG
2023 ICLR Workshop TML4H (Sik-Ho Tsang @ Medium)

Medical/Clinical/Healthcare NLP/LLM
2017 [LiveQA] … 2023 [MultiMedQA, HealthSearchQA, Med-PaLM] [Med-PaLM 2] [GPT-4 in Radiology] [ChatGPT & GPT‑4 on USMLE] [Regulatory Oversight of LLM]
==== My Other Paper Readings Are Also Over Here ====

  • ExBEHRT, an extended version of BEHRT (BERT applied to electronic health record data), is proposed.
  • While BEHRT only considers diagnoses and patient age, ExBEHRT extends the feature space to several multi-modal records, namely demographics, clinical characteristics, vital signs, smoking status, diagnoses, procedures, medications and lab tests by applying a novel method to unify the frequencies and temporal dimensions of the different features.


  1. ExBEHRT
  2. Pretraining Cohort
  3. Results


1.1. Data Format

ExBEHRT Example (D: Diagnosis, P: Prcedures, SEP: Separation; PAD: Padding; CLS: Classification)

ExBEHRT is an extension of BEHRT where medical concepts are not concatenated into one long vector, but grouped into separate, learnable embeddings per concept type.

  • Input lengths can be avoiding exploding when adding new medical features and give the model the opportunity to learn which concepts it should focus on.
  • Therefore, the maximum length of the patient journey is defined by the number of diagnosis codes of a patient, regardless of the number of other concepts added to the model.
  • The above figure is exampled below:
  1. The number of procedures is equal to the amount of horizontal slots available in the visit (visit 1 — two each). The procedures can therefore be represented as a 1D vector.
  2. The number of procedures exceeds the amount of slots available in the visit (visit 2 — one diagnosis, two procedures). Here, the procedures fill up the number of horizontal slots line by line until there are no more procedures left, resulting in a 2D vector of dimensions #slots d#procedures #slots e.
  3. The number of procedures subceeds the amount of slots available (visit 3 — one diagnosis, no procedures). The procedures are represented as a 1D vector and then padded to the amount of horizontal slots available.
ExBEHRT Input Example

After reshaping, all procedures and labs of all patients are padded to the same amount of rows n to enable batch processing. Before passing the inputs to the model, each token is embedded into a 288-dimensional vector and all tokens are summed vertically.

  • An example in appendix of the paper is shown as above.

1.2. Model Training

ExBEHRT consists of the same model architecture as BEHRT.

  • For pre-training, the standard MLM procedure as in BERT is used.
  • The amount of attention layers (6) and heads (12) as well as embedding dimension (288) are used.

An additional pre-training objective PLOS, as introduced by Med-BERT, is also applied, which called ExBEHRT+P, i.e. binary classification of whether a patient had at least one prolonged length of stay in hospital (> 7 days) during their journey.

In a second step, the models are fine-tuned on two prediction tasks: Death of a patient within six months after the first cancer diagnosis and re-admission into hospital within 30 or fewer days after heart failure.

  • For these two prediction tasks, the cohorts consist of 437’902 patients (31.67% deceased within 6 months after first cancer diagnosis) for Death in 6M and 503’161 patients (28.24% readmitted within 30 days) for HF readmit.

2. Pretraining Cohort

Pretraining Cohort Statistics
  • Optum de-identified EHR database is used. It is derived from healthcare provider organizations in the United States, which include more than 57 contributing sources and 111,000 sites of care, treating more than 106 million patients.
  • Demographics, medications prescribed and administered, immunizations, allergies, lab results (including microbiology), vital signs and other observable measurements, clinical and hospitalisation administrative data, and coded diagnoses and procedures, are included.
  • The population in Optum EHR is geographically diverse, spanning all 50 US states.
  • Only data points during hospitalisations are collected to ensure data quality and consistency.
  • Each patient must have at least 5 visits with valid ICD-9 or ICD-10 diagnosis codes to ensure sufficient temporal context.

Considering these criteria, the final pre-training cohort consisted of 5.4 million individual patients divided into training (80%), validation (10%) and testing (10%) groups.

2. Results

  • The metrics used for evaluation are the area under the receiver operating characteristic curve (AUROC), average precision score (APS) as well as the precision at the 0.5 threshold.

In all but one metric in one task, ExBEHRT outperforms BEHRT, Med-BERT and other conventional algorithms such as Logistic Regression (LR) and XGBoost when evaluated on this hold-out dataset.

Absolute Sum of Expected Gradients
  • The expected gradients for each of the input features are summed.

For this patient, the diagnoses and procedures (treatments & medications) were by far the most important features. With this visualisation, we can also assess basic biases. For example, gender was not considered an important characteristic.

Absolute Sum of Expected Gradients Along Time

Unsurprisingly, the cancer code C81 had the greatest influence on the result. However, earlier codes such as J40 or 71020 also contribute to the model’s prediction, indicating that the model is able to incorporate information from the entire patient journey into its results.

  • (Please kindly read the paper for more detailed results.)



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.