Review — Self-Supervised Pre-Training of Swin Transformers for 3D Medical Image Analysis
Swin UNETR with Self-Supervised Learning
Self-Supervised Pre-Training of Swin Transformers for 3D Medical Image Analysis,
Self-Supervised Swin UNETR, by Vanderbilt University, and NVIDIA,
2022 CVPR, Over 70 Citations (Sik-Ho Tsang @ Medium)
Medical Imaging, Medical Image Analysis, Image Segmentation, U-Net, Transformer, Vision Transformer, ViTBiomedical Image Segmentation
2015 … 2022 [UNETR]
Biomedical Image Self-Supervised Learning
2018 … 2022 [BT-Unet] [Taleb JDiagnostics’22]
==== My Other Paper Readings Are Also Over Here ====
- 3D Swin UNEt TRansformers (Swin UNETR), is proposed with a hierarchical encoder for self-supervised pre-training.
- Tailored proxy tasks are designed for learning the underlying pattern of human anatomy.
Outline
- Brief Review of Swin UNETR
- Self-Supervised Swin UNETR
- Results
1. Brief Review of Swin UNETR
- (Please skip this part if you know Swin UNETR well, or please read Swin UNETR for more details.)
1.1. Input
- The Swin UNETR creates non-overlapping patches of the input data and uses a patch partition layer to create windows with a desired size for computing the self-attention.
1.2. Encoder
- Swin UNETR encoder has 4 stages which comprise of 2 Transformer blocks at each stage.
- Two Transformer blocks are W-MSA and SW-MSA, which refer to regular and window partitioning multi-head self-attention modules respectively.
- (Please feel free to read about Swin Transformer for the shifted window self attention.)
1.3. Decoder
- The encoded feature representations in the Swin Transformer are fed to a CNN-decoder via skip connection at multiple resolutions.
- At each stage, the output feature representations are reshaped and fed into a residual block comprising of two 3×3×3 convolutional layers that are normalized by instance normalization.
- The resolution of the feature maps are increased by a factor of 2 using a deconvolutional layer and the outputs are concatenated with the outputs of the previous stage.
1.4. Output
- The final segmentation outputs are computed by using a 1×1×1 convolutional layer and a sigmoid activation function.
2. Self-Supervised Swin UNETR
- Input CT images are randomly cropped into sub-volumes and augmented with random inner cutout and rotation, then fed to the Swin UNETR encoder as input.
- Masked volume inpainting, contrastive learning, and rotation prediction are used as proxy tasks for learning contextual representations of input images.
2.1. Masked Volume Inpainting
- The cutout augmentation masks out ROIs in the sub-volume X randomly with volume ratio of s.
- A transpose convolution layer is attached to the encoder as the reconstruction head and denote its output as XˆM.
- The reconstruction objective is defined by an L1 loss between X and XˆM:
2.2. Image Rotation
- For simplicity, R classes of 0°, 90°, 180°, 270° rotations are employed along the z-axis.
- An MLP classification head is used for predicting the softmax probabilities ˆyr of rotation categories.
- Given the ground truth yr, a cross-entropy loss is used for rotation prediction task:
- The 3D rotation and cutout also serves simultaneously as an augmentation transformation for contrastive learning.
2.3. Contrastive Coding
- Given a batch of augmented sub-volumes, the contrastive coding allows for a better representation learning by maximizing the mutual information between positive pairs (augmented samples from same sub-volume), while minimizing that between negative pairs (views from different sub-volumes).
- Similar to SimCLR, the 3D contrastive coding loss between a pair vi and vj is defined as:
- The contrastive learning loss function strengthens the intra-class compactness as well as the inter-class separability.
2.4. Total Loss
- The total loss is:
- A grid-search hyper-parameter optimization was performed which estimated the optimal values of λ1=λ2=λ3=1.
3. Results
3.1. BTCV Multi-organ Segmentation Challenge
Compared with other top submissions, the proposed Swin UNETR achieves the best performance.
The representative samples demonstrate the success of identifying organ details by Swin UNETR.
3.2. Segmentation Results on MSD
Overall, Swin UNETR presents the best average Dice of 78.68% across all ten tasks and achieves the top ranking in the MSD leaderboard.
- The detail number of multiple tasks are shown above.
The proposed Swin UNETR achieves state-of-the-art performance in Task01 BrainTumour, Task06 Lung, Task07 Pancreas, and Task10 Colon.
The results are comparable for Task02 Heart, Task03 Liver, Task04 Hippocampus, Task05 Prostate, Task08 HepaticVessel and Task09 Spleen.
Swin UNETR with self-supervised pre-training demonstrates visually better segmentation results in the CT tasks.
3.3. Ablation Study
- A comparison of all MSD CT tasks using pre-trained model against training from scratch can be observed above.
Distinct improvement can be observed for Task03 Liver, Dice of 77.77% comparing to 75.27%. Task08 Hepatic Vessel achieves 68.52% against 64.63%. Task10 Colon shows the largest improvement, from 34.83% to 43.38%.
The proposed approach can reduce the annotation effort by at least 40% for BTCV task.
- The fine-tuning results are obtained from pre-training 100, 3,000, and 5,000 scans.
The proposed model can benefit from larger pre-training datasets with increasing size of unlabeled data.
- On BTCV test set, using pre-trained weights by inpainting achieves the highest improvement at single task modeling.
Overall, employing all proxy tasks achieves best Dice of 84.72%.