Brief Review — CheXbreak: Misclassification Identification for Deep Learning Models Interpreting Chest X-rays

CheXbreak, Misclassification Identifier

Sik-Ho Tsang
5 min readDec 30, 2022


MLHC (Images from MLHC)

CheXbreak: Misclassification Identification for Deep Learning Models Interpreting Chest X-rays,
CheXbreak, by Stanford University,
2021 MLHC (Sik-Ho Tsang @ Medium)
Medical Image Analysis, Medical Imaging, Image Classification, CheXpert

  • There is an obstacle to the integration of deep learning models for chest x-ray interpretation into clinical settings is the lack of understanding of their failure modes. Some patient groups are easily misclassified.
  • It is found that patient age and the radiographic finding of lung lesion, pneumothorax or support devices are statistically relevant features for predicting misclassification for some chest x-ray models.
  • Misclassification predictors are developed on chest x-ray models using their model outputs and clinical features.
  • This is a paper from Prof. Andrew Ng research group.


  1. CheXbreak Misclassification Identifier
  2. Analytic Results
  3. Flipping Predicted Misclassifications Results

1. CheXbreak

1.1. Dataset & Models

  • CheXpert is used as dataset.
  • 10 CheXpert-trained models (or CheXpert models) are used in the experiments, each selected sequentially from the top of the leaderboard.

1.2. Pipeline

CheXbreak: Misclassification identifiers based on outputs of top CheXpert models and clinical features available for each study.
  • Misclassification identifiers are built based on outputs of top CheXpert models and clinical features available for each study.
  • These identifiers are then used to selectively flip results for performance improvement.

1.3. Model Misclassification Based on Clinical Features

Example of Misclassification Ground Truth for a single study.
  • Even if these clinical features are widely available, they may not be used by developers in training their models.

Authors seek to determine whether clinical features are valuable for predicting misclassification.

  • Logistic regression models are developed on available clinical features {age, sex, presence of lateral view, number of AP views, number of PA views} to predict the probability of misclassification and evaluate which features provide new meaningful information to the models.
  • A logistic regression model is constructed for each of the 10 models and for each of the 5 tasks, resulting in a total of 50 statistical models.
  • The statistical significance of each feature is evaluated over all the statistical models.
  • Statistical significance is demonstrated when a clinical feature exhibits a p-value less than 0.05. The average odds ratio across is computed models of the same task and calculate their respective 95% confidence intervals.
  • To obtain the misclassification ground truth, 504 studies of the combined CheXpert validation and test sets are randomly selected as a “training” set and a prediction threshold is found that maximizes the Youden’s Index on each disease. This threshold is then used to binarize outputs of all models across all 700 studies.

2. Analytic Results

2.1. Clinical Features

Average odds ratios for models that use a given feature to predict one of the 5 diseases.

Age is a significant predictor of misclassification for Atelectasis on five models, for Cardiomegaly on seven models, for Pleural Effusion on five models, for Consolidation on five models and for Edema on two models.

  • An odds ratio greater than 1 indicates that as age increases, so does the incidence of misclassification.

The presence of lateral views is a significant predictor of misclassification by p-value for two models on Edema.

  • An additional lateral view greatly decrease rates of misclassification on those few models for which it is statistically significant.

2.2. Other Features

Average odds ratios for models that use a given feature to predict one of the 5 diseases.
  • A nearly identical procedure from the clinical features experiment is replicated to determine the association between the information about the presence of other diseases with misclassification of a given model on a given task.

The presence or absence of Support Devices is a statistically significant predictor of misclassification for three models detecting Cardiomegaly, for four models detecting Pleural Effusion, and for two models detecting Consolidation.

2.3. Model Outputs + Clinical Features

Misclassification Identifier Performance, reported as 95% con dance intervals averaged over 10 CheXpert models.
  • LightGBM classifiers are trained with three different types of input data and their performance is evaluated on the task of predicting misclassification.
  • The first is trained on clinical features only (“clinical only”); the second on clinical features and the model output value of the disease of interest (“same label”); and the third on clinical features and the model output of all diseases (“all labels”).
  • “clinical only” identifier performs the worst.
  • The “same label” and “all labels” identifiers perform very similarly and score the highest AUROC across all diseases, with a mean of 0.881 and 0.880, respectively.

3. Flipping Predicted Misclassifications Results

A confusion matrix on flipping can be broken down into four sub-matrices based on presence of disease and partition.
  • Studies that are flipped are predicted as misclassified while others are predicted as correctly classified. The intuition behind is to:

1) flip more wrong model predictions than correct and 2) increase true positive of disease predictions as much as possible.

F1 change for CheXpert models after flipping on Consolidation prediction with the “same label” (left) and on Edema with “all labels” (right) identifier.
F1 change averaged over all models (with 95% con dence interval) after flipping.
  • When flipping with the “same label” misclassification identifier, we see a statistically significant F1 improvement on Consolidation.

Overall, the results suggest that we can improve model outputs by building misclassification predictors based on the logits and clinical feature and following a corrective algorithm.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.