Review — CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison

CheXpert, A Large Chest X-Ray Dataset

Sik-Ho Tsang
6 min readJul 18, 2022

CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison
, by Stanford University
2019 AAAI, Over 1000 Citations (Sik-Ho Tsang @ Medium)
Image Classification, Medical Image Classification

  • A large dataset, CheXpert (Chest eXpert), is collected, which contains 224,316 chest radiographs of 65,240 patients.
  • A labeler is designed to automatically detect the presence of 14 observations in radiology reports, capturing.
  • Different approaches using the uncertainty labels are investigated for training convolutional neural networks.


  1. CheXpert (Chest eXpert) Dataset
  2. Label Extraction from Radiology Reports
  3. Uncertainty Approaches
  4. Results

1. CheXpert (Chest eXpert) Dataset

The CheXpert task is to predict the probability of different observations from multi-view chest radiographs
  • The CheXpert task is to predict the probability of 14 different observations from multi-view chest radiographs.
The CheXpert dataset consists of 14 labeled observations
  • This dataset consists of 224,316 chest radiographs of 65,240 patients labeled for the presence of 14 observations as positive, negative, or uncertain, as shown above.
  • They are collected from Stanford Hospital, performed between October 2002 and July 2017 in both inpatient and outpatient centers.
  • “Pneumonia”, despite being a clinical diagnosis, was included as a label. The “No Finding” observation was intended to capture the absence of all pathologies.

2. Label Extraction from Radiology Reports

  • An automated rule-based labeler is developed to extract observations from the free text radiology reports to be used as structured labels for the images.
  • The labeler is set up in three distinct stages: mention extraction, mention classification, and mention aggregation.
Output of the labeler when run on a report sampled from the dataset
  1. Mention Extraction: The labeler extracts mentions from a list of observations from the Impression section of radiology reports, which summarizes the key findings in the radiographic study.
  2. Mention Classification: After extracting mentions of observations, the aim is to classify them as negative (“no evidence of pulmonary edema, pleural effusions or pneumothorax”), uncertain (“diffuse reticular pattern may represent mild interstitial pulmonary edema”), or positive (“moderate bilateral effusions and bibasilar opacities”). The ‘uncertain’ label can capture both the uncertainty of a radiologist in the diagnosis as well as ambiguity inherent in the report (“heart size is stable”).
  3. Mention Aggregation: The classification for each mention of observations comes to arrive at a final label for 14 observations that consist of 12 pathologies as well as the “Support Devices” and “No Finding” observations, as shown in the table above.
  • (Please feel free to read the paper directly for the details.)

3. Uncertainty Approaches

  • The training labels in the dataset for each observation are either 0 (negative), 1 (positive), or u (uncertain).
  • Multiple uncertainty approaches are investigated as below.

3.1. Ignoring (U-Ignore)

  • A simple approach to handling uncertainty is to ignore the u labels during training, which serves as a baseline.
  • The sum of the masked binary cross-entropy losses over the observations is estimated.
  • Formally, the loss for an example X is given by:
  • where X is the input image, y is the vector of labels of length 14 for the study, and the sum is taken over all 14 observations.
  • Ignoring the uncertainty label is analogous to the listwise (complete case) deletion method for imputation.

This approach ignores a large proportion of labels, reducing the effective size of the dataset.

3.2. Binary Mapping (U-Zeroes, U-Ones)

  • All instances of u are either mapped to 0 (U-Zeroes model), or all to 1 (U-Ones model).

It is expected that this approach can distort the decision making of classifiers and degrade their performance.

3.3. Self-Training (U-SelfTrained)

  • One framework for approaching uncertainty labels is to consider them as unlabeled examples, lending its way to semi-supervised learning.
  • A self-training approach (U-SelfTrained) is considered. In this approach, a model is first trained using the U-Ignore approach.
  • Then, the model is used to make predictions that re-label each of the uncertainty labels.
  • When the prediction is above a certain threshold, those samples are used as labeled data for training, this process repeats until convergence.

3.4. 3-Class Classification (U-MultiClass)

  • In this approach (U-MultiClass model), for each observation, we output the probability of each of the 3 possible classes {p0, p1; pu}.
  • The loss is set up as the mean of the multi-class cross-entropy losses over the observations.
  • At test time, for the probability of a particular observation, the probability of the positive label is output after applying a softmax restricted to the positive and negative classes.

4. Results

4.1. Labeler Results

Performance of the labeler of NIH and the proposed labeler on the report evaluation set on tasks of mention extraction, uncertainty detection, and negation detection, as measured by the F1 score
  • The proposed labeling algorithm significantly outperforms the NIH labeler on Atelectasis and Cardiomegaly, and achieves notably better performance on Consolidation and Pneumonia.

4.2. Performance of Uncertainty Approaches

AUROC scores on the validation set of the models trained using different approaches to using uncertainty labels
  • ResNet-152, DenseNet-121, Inception-v4, and SE-ResNeXt-101 are tried. And it is found that the DenseNet-121 architecture produced the best results. Thus, DenseNet-121 is used.
  • Images are fed into the network with size 320×320 pixels.
  • Batches are sampled using a fixed batch size of 16 images. Models are trained for 3 epochs, checkpoints are saved every 4800 iterations.
  • U-Ones obtains the best performance on Atelectasis while UMultiClass obtains the best performance on Cadiomegaly.

However, none of them has significant better results on the remaining Consolidation, Edema and Pleural Effusion.

4.3. Comparison to Radiologists

Performance of 3 radiologists is compared to the model against the test set ground truth in both the ROC and the PR space
  • The uncertainty labels are converted to binary labels by computing the upper bound of the labels performance (by assigning the uncertain labels to the ground truth values) and the lower bound of the labels (by assigning the uncertain labels to the opposite of the ground truth values).
  • The two operating points on the curves, denoted LabelU and LabelL respectively, are plotted.
  • The model achieves the best AUC on Pleural Effusion (0.97), and the worst on Atelectasis (0.85). The AUC of all other observations are at least 0.9.
  • The model achieves the best AUPRC on Pleural Effusion (0.91) and the worst on Consolidation (0.44).

On Cardiomegaly, Edema, and Pleural Effusion, the model achieves higher performance than all 3 radiologists but not their majority vote.

4.4. Visualizations

The final model localizes findings in radiographs using Gradient-weighted Class Activation Mappings
  • Grad-CAM is used to visualize the areas of the radiograph which the model predicts to be most indicative of each observation, as shown above.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.