Brief Review — CheXbreak: Misclassification Identification for Deep Learning Models Interpreting Chest X-rays
CheXbreak, Misclassification Identifier
CheXbreak: Misclassification Identification for Deep Learning Models Interpreting Chest X-rays,
CheXbreak, by Stanford University,
2021 MLHC (Sik-Ho Tsang @ Medium)
Medical Image Analysis, Medical Imaging, Image Classification, CheXpert
- There is an obstacle to the integration of deep learning models for chest x-ray interpretation into clinical settings is the lack of understanding of their failure modes. Some patient groups are easily misclassified.
- It is found that patient age and the radiographic finding of lung lesion, pneumothorax or support devices are statistically relevant features for predicting misclassification for some chest x-ray models.
- Misclassification predictors are developed on chest x-ray models using their model outputs and clinical features.
- This is a paper from Prof. Andrew Ng research group.
Outline
- CheXbreak Misclassification Identifier
- Analytic Results
- Flipping Predicted Misclassifications Results
1. CheXbreak
1.1. Dataset & Models
- CheXpert is used as dataset.
- 10 CheXpert-trained models (or CheXpert models) are used in the experiments, each selected sequentially from the top of the leaderboard.
1.2. Pipeline
- Misclassification identifiers are built based on outputs of top CheXpert models and clinical features available for each study.
- These identifiers are then used to selectively flip results for performance improvement.
1.3. Model Misclassification Based on Clinical Features
- Even if these clinical features are widely available, they may not be used by developers in training their models.
Authors seek to determine whether clinical features are valuable for predicting misclassification.
- Logistic regression models are developed on available clinical features {age, sex, presence of lateral view, number of AP views, number of PA views} to predict the probability of misclassification and evaluate which features provide new meaningful information to the models.
- A logistic regression model is constructed for each of the 10 models and for each of the 5 tasks, resulting in a total of 50 statistical models.
- The statistical significance of each feature is evaluated over all the statistical models.
- Statistical significance is demonstrated when a clinical feature exhibits a p-value less than 0.05. The average odds ratio across is computed models of the same task and calculate their respective 95% confidence intervals.
- To obtain the misclassification ground truth, 504 studies of the combined CheXpert validation and test sets are randomly selected as a “training” set and a prediction threshold is found that maximizes the Youden’s Index on each disease. This threshold is then used to binarize outputs of all models across all 700 studies.
2. Analytic Results
2.1. Clinical Features
Age is a significant predictor of misclassification for Atelectasis on five models, for Cardiomegaly on seven models, for Pleural Effusion on five models, for Consolidation on five models and for Edema on two models.
- An odds ratio greater than 1 indicates that as age increases, so does the incidence of misclassification.
The presence of lateral views is a significant predictor of misclassification by p-value for two models on Edema.
- An additional lateral view greatly decrease rates of misclassification on those few models for which it is statistically significant.
2.2. Other Features
- A nearly identical procedure from the clinical features experiment is replicated to determine the association between the information about the presence of other diseases with misclassification of a given model on a given task.
The presence or absence of Support Devices is a statistically significant predictor of misclassification for three models detecting Cardiomegaly, for four models detecting Pleural Effusion, and for two models detecting Consolidation.
2.3. Model Outputs + Clinical Features
- LightGBM classifiers are trained with three different types of input data and their performance is evaluated on the task of predicting misclassification.
- The first is trained on clinical features only (“clinical only”); the second on clinical features and the model output value of the disease of interest (“same label”); and the third on clinical features and the model output of all diseases (“all labels”).
- “clinical only” identifier performs the worst.
- The “same label” and “all labels” identifiers perform very similarly and score the highest AUROC across all diseases, with a mean of 0.881 and 0.880, respectively.
3. Flipping Predicted Misclassifications Results
- Studies that are flipped are predicted as misclassified while others are predicted as correctly classified. The intuition behind is to:
1) flip more wrong model predictions than correct and 2) increase true positive of disease predictions as much as possible.
- When flipping with the “same label” misclassification identifier, we see a statistically significant F1 improvement on Consolidation.
Overall, the results suggest that we can improve model outputs by building misclassification predictors based on the logits and clinical feature and following a corrective algorithm.
Reference
[2021 MLHC] [CheXbreak]
CheXbreak: Misclassification Identification for Deep Learning Models Interpreting Chest X-rays
4.1. Biomedical Image Classification
2017 [ChestX-ray8] 2019 [CheXpert] 2020 [VGGNet for COVID-19] [Dermatology] [Deep-COVID] [Zeimarani ACCESS’20] 2021 [CheXternal] [CheXtransfer] [CheXbreak]