Review — BEHRT: Transformer for Electronic Health Records

BEHRT, BERT for Electronic Health Records (EHR)

Sik-Ho Tsang
7 min readSep 29


BEHRT: Transformer for Electronic Health Records
, by University of Oxford
2020 Nature Sci. Rep., Over 300 Citations (Sik-Ho Tsang @ Medium)

Medical LLM
==== My Other Paper Readings Are Also Over Here ====

  • Early indication and detection of diseases can provide patients and carers with the chance of early intervention, better disease management, and efficient allocation of healthcare resources.
  • BEHRT (BERT for EHR) is introduced, which is a deep neural sequence transduction model for electronic health records (EHR), capable of simultaneously predicting the likelihood of 301 conditions in one’s future visits.
  • Recently, an extended version of BEHRT, ExBEHRT, is published in 2023 ICLR Workshop TML4H. (Hope I can read it later in the future.)


  1. Dataset
  2. BEHRT (BERT for EHR)
  3. Results

1. Dataset

1.1. CPRD & HES Data Source

Clinical Practice Research Datalink (CPRD) is used as the source: It contains longitudinal primary care data from a network of 674 GP (general practitioner) practices in the UK, which is linked to secondary care (i.e., hospital episode statistics or HES)

  • Around 1 in 10 GP practices (and nearly 7% of the population) in the UK contribute data to CPRD; it covers 35 million patients, among whom nearly 10 million are currently registered patients.
  • HES contains data on hospitalisations, outpatient visits, accident and emergency for all admissions to National Health Service (NHS) hospitals in England. Approximately 75% of the CPRD GP practices in England (58% of all UK CPRD GP practices) participate in patient-level record linkage with HES, which is performed by the Health and Social Care Information Centre.

In this study, authors only considered the data from GP practices that consented to (and hence have) record linkage with HES.

  • The importance of primary care at the centre of the national health system in the UK, the additional linkages, and all the aforementioned properties, make CPRD one of the most suitable EHR datasets in the world for data-driven clinical/medical discovery and machine learning.

1.2. Data Filtering

Data Filtering
  • CPRD: Start with about 8 million patients.
  • Only included patients that are eligible for linkage to HES and meet CPRD’s quality standards, etc., and only kept individuals who have at least 5 visits in their EHR.

At the end of this process, P = 1.6 million patients are remained to train and evaluate BEHRT.

1.3. Data Labeling

For the data label, both ICD-10 codes (at level 4) and Read codes are mapped to Caliber codes. This process is done by an expert checking mapping dictionary from University College London. Eventually, this resulted in a total of G = 301 codes for diagnoses, i.e. 301 multi-class classification problem.

  • All these diseases are listed as D with a set of {di} where i is from 1 to G, where di denotes the i-th disease code.

1.4. Final Data

Input Data Format Into BEHRT

In the final data, for each patient p ∈ {1, 2, …, P} the medical history consists of a sequence of visits to GP and hospitals; each visit can contain concepts such as diagnoses, medications, measurements and more.

  • In this paper, each patient’s EHR is denoted as:
  • where np denotes the number of visits in patient p’s EHR.
  • vpj contains the diagnoses in the jth visit which can be a list of mpj diagnoses.
  • For the input data in BERT, there are CLS and SEP special codewords for NLP. In BEHRT case, CLS is the start of medical history, SEP is used to separate each visit.
  • After defining the input data, this data can be fed to BERT for training.


BEHRT Model Architecture
  • Given patient’s past EHR, BEHRT predicts his/her future diagnoses (if any), as a multi-label classification problem.

Since BERT is a Transformer-based model, it can model (C.1) complex and nonlinear interactions among past, present and future concepts; (C.2) long-term dependencies among concepts (e.g., diseases occurring early in the history of a patient effecting events far later in future); (C.3) difficulties of representing multiple heterogeneous concepts of variable sizes and forms to the model; and (C.4) the irregular intervals between consecutive visits

  • By depicting diagnoses as words, each visit as a sentence, and a patient’s entire medical history as a document, the use of multi-head self-attention, positional encoding, and masked language model (MLM) in BERT, become easy for EHR.

2.1. Embedding Layers

The embedding layer in BEHRT, as shown in Fig. 3 above, learns the evolution of one’s EHR through a combination of four embeddings: disease, “position”, age, and “visit segment” (or “segment”, for short). This combination enables BEHRT to define a representation that can capture one’s EHR in as much detail as possible.

  • Age and visit segment are two embeddings that are unique to BEHRT.
  • Visit segment can be either A or B, which are two symbols to represent two trainable vectors in the segment embeddings; it changes alternatively between visits.
  • That is, for a given visit, the position, age, and segment embedding will be identical; this makes BEHRT order-invariant for intra-visit concepts.

2.2. Masked Language Model (MLM) Pretraining

  • Disease, age, and segment embeddings are initialized randomly, and the positional encoding stems from a pre-determined encoding of position.
  • When training the network and specifically, the embeddings for the MLM task, 86.5% of the disease words are left unchanged; 12% of the words were replaced with [mask]; and the remaining 1.5% of words, were replaced with randomly-chosen disease words.
  • The average precision score is calculated over all labels and over all patients. We see in Fig. 3b that the MLM classifier maps the tokens T1…TN to the masked words.

If the model can predict the masked words, that means the model has learnt the knowledge of EHR well.

2.3. Disease Prediction Tasks

  • In order to provide a comprehensive evaluation of BEHRT, we assess its learning in 3 predictive tasks: prediction of diseases in the next visit (T1), prediction of diseases in the next 6 months (T2), and prediction of diseases in the next 12 months (T3).
  • There are 699K, 391K, and 342K patients for T1, T2, and T3, respectively. 3 BEHRT models are trained separately for T1, T2, and T3.

2.4. Model Parameters

  • By Bayesian Optimization, the optimal architecture is found which has 6 layers, 12 attention heads, intermediate layer size of 512, and hidden size of 288.
  • 100 epochs are used for pretraining and the model’s performance was 0.6597 in precision score.

3. Results

3.1. t-SNE Visualization

t-SNE Visualization
  • The colour in Fig. 4 represent the original Caliber disease chapters.
  • For instance, diseases that are unique to women (e.g., endometriosis, dysmenorrhea, menorrhagia, …) are quite distant from those that are unique to men (e.g., erectile dysfunction, primary malignancy of prostate, …).

The clinical researcher notes that while many of the most similar associations had clear overlap in symptomatology, some were graded to be poor disease associations. Thus, the researcher concludes that BEHRT has a strong ability to understand the latent characteristics of the disease, without them being explicitly given to it.

3.2. Attention and Interpretability

Attention Correlation Analysis

For patient A in Fig. 5, for example, the self-attention mechanism has shown strong connections between rheumatoid arthritis and enthesopathies and synovial disorders (far in the future of the patient). This is a great example of where attention can go beyond recent events and find long-range dependencies among diseases.

3.3. Disease Prediction

  • The 3 supervised subsequent prediction task models (BEHRT, Deepr and RETAIN) were trained (fine-tuned) for 15–20 epochs.

The above table demostrates BEHRT’s superior predictive power compared to two of the most successful approaches in the literature (i.e., Deepr12 and RETAIN17).

Disease-Wise Precision

BEHRT is able to make predictions with relatively high precision and recall for diseases such as epilepsy (0.016), primary malignancy prostate (0.011), polymyalgia rheumatica (0.013), hypo or hyperthyroidism (0.047), and depression (0.0768).

  • Lastly, for a subset of diseases in the label that are occurring for the first time (first incidence), predictive performance of the three models is calculated in Table 2.

BEHRT shows superior predictive performance in all three tasks with respect to RETAIN and Deepr.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.