Review — Swin UNETR: Swin Transformers for Semantic Segmentation of Brain Tumors in MRI Images
Swin UNETR: Swin Transformers for Semantic Segmentation of Brain Tumors in MRI Images,
Swin UNETR, by NVIDIA, and Vanderbilt University,
2021 BrainLes, Over 70 Citations (Sik-Ho Tsang @ Medium)
Medical Imaging, Medical Image Analysis, Image Segmentation, U-Net, Swin Transformer, UNETR
- Swin UNEt TRansformers (Swin UNETR) is proposed, in which the task of 3D brain tumor semantic segmentation is reformulated as a sequence to sequence prediction problem wherein multi-modal input data is projected into a 1D sequence of embedding and used as an input to a hierarchical Swin Transformer as the encoder.
- The Swin Transformer encoder extracts features at five different resolutions by utilizing shifted windows for computing self-attention and is connected to an CCN-based decoder at each resolution via skip connections.
1. Swin Transformer Block
- Input (Left): The input to the Swin UNETR model X of size H×W×D×S.
- A patch partition layer is used to create a sequence of 3D tokens and and project them into an embedding space with dimension C.
- W-MSA and SW-MSA are regular and window partitioning multi-head self-attention modules respectively.
- W-MSA (Middle): At layer l, the non-overlapping partition is performed to obtain 3D tokens, and self-attention is performed within each 3D tokens.
- SW-MSA (Right): At layer l+1, partition window is shifted before performing the self-attention.
- (Please feel free to read about Swin Transformer for the shifted window self attention.)
2. Swin UNEt TRansformers (Swin UNETR)
- The Swin UNETR creates non-overlapping patches of the input data and uses a patch partition layer to create windows with a desired size for computing the self-attention.
- The Swin UNETR encoder has a patch size of 2×2×2 and a feature dimension of 2×2×2×4 = 32, taking into account the multi-modal MRI images with 4 channels. The size of the embedding space C is 48.
- Swin UNETR encoder has 4 stages which comprise of 2 transformer blocks at each stage. Hence, the total number of layers in the encoder is L=8.
- A patch merging layer is utilized to decrease the resolution of feature representations by a factor of 2 at the end of each stage. In addition, it groups patches with resolution 2×2×2 and concatenates them, resulting in a 4C-dimensional feature embedding.
- The feature size of the representations are subsequently reduced to 2C with a linear layer.
- The encoded feature representations in the Swin transformer are fed to a CNN-decoder via skip connection at multiple resolutions.
- At each stage, the output feature representations are reshaped and fed into a residual block comprising of two 3×3×3 convolutional layers that are normalized by instance normalization.
- The resolution of the feature maps are increased by a factor of 2 using a deconvolutional layer and the outputs are concatenated with the outputs of the previous stage.
- Final segmentation output consists of 3 output channels corresponding to ET, WT and TC sub-regions.
- The final segmentation outputs are computed by using a 1×1×1 convolutional layer and a sigmoid activation function.
2.4. Loss Function
- Soft dice loss is used:
The proposed Swin UNETR model outperforms all competing approaches across all 5 folds and on average for all semantic classes (e.g. ET, WT, TC), e.g.: better than nnU-Net.
The proposed benchmarks (Team: NVOptNet) are considered as one of the top-ranking methodologies across more than 2000 submissions during the validation phase, hence being the first transformer-based model to place competitively in BraTS challenges.
The segmentation performance of ET and WT are very similar to those of the validation benchmarks. However, the segmentation performance of TC is decreased by 0.9%.
Consistent with quantitative benchmarks, the segmentation outputs are well-delineated for all three sub-regions.