Review — TransBTS: Multimodal Brain Tumor Segmentation Using Transformer
3D-CNN for Encoder & Decoder, Transformer for Bottleneck
TransBTS: Multimodal Brain Tumor Segmentation Using Transformer,
TransBTS, by University of Science and Technology, University of Central Florida, and Scoop Medical,
2021 MICCAI, Over 200 Citations (Sik-Ho Tsang @ Medium)
Medical Imaging, Medical Image Analysis, Image Segmentation, U-Net, Transformer
4.2. Biomedical Image Segmentation
2015 … 2021 [Expanded U-Net] [3-D RU-Net] [nnU-Net] [TransUNet] [CoTr]
My Other Previous Paper Readings Are Also Over Here
- TransBTS based on the encoder-decoder structure is proposed, which is the first time exploit Transformer in 3D CNN for MRI Brain Tumor Segmentation.
- To capture the local 3D context information, the encoder first utilizes 3D CNN to extract the volumetric spatial feature maps. Meanwhile, the feature maps are reformed elaborately for tokens that are fed into Transformer for global feature modeling.
- The decoder leverages the features embedded by Transformer and performs progressive upsampling to predict the detailed segmentation map.
1.1. 3D CNN Encoder
- Given an input MRI scan X of size C×H×W×D with a spatial resolution of H×W, depth dimension of D (# of slices) and C channels (# of modalities), 3D CNN is used to extract volumetric feature maps.
- 3×3×3 convolution blocks with downsamping (strided convolution with stride=2) are stacked to gradually encode input images into low-resolution/high-level feature representation F, of size K×H/8×W/8×D/8 (K=128), which is 1/8 of input dimensions of H, W and D.
In this way, rich local 3D context features are effectively embedded in F. Then, F is fed into the Transformer encoder to further learn long-range correlations with a global receptive field.
1.2. Transformer Encoder
- A linear projection (a 3×3×3 convolutional layer) is used to increase the channel dimension from K=128 to d=512.
- The spatial and depth dimensions are collapsed into one dimension, resulting in a d×N (N=H/8×W/8×D/8) feature map f, which can be also regarded as N d-dimensional tokens.
- The learnable position embeddings PE are added with the feature map f, creating the feature embeddings z0:
- The Transformer encoder is composed of L Transformer layers, which consists of a Multi-Head Attention (MHA) block and a Feed Forward Network (FFN). The output of the l-th Transformer layer can be calculated by:
1.3. Network Decoder
- A 3D CNN decoder is used to perform feature upsampling and pixel-level segmentation.
- The output sequence of Transformer zL of size d×N is first reshaped to d×H/8×W/8×D/8.
- To reduce the computational complexity of decoder, a convolution block is employed to reduce the channel dimension from d to K.
- The feature map Z, which has the same dimension as F in the feature encoding part, is obtained.
- After the feature mapping, cascaded upsampling operations and convolution blocks are applied to Z to gradually recover a full resolution segmentation result R.
- Skip connections are employed to fuse the encoder features with the decoder counterparts by concatenation for finer segmentation masks with richer spatial details.
- Softmax Dice loss with weight decay is employed as loss function.
1.4. Differences From TransUNet
2.1. SOTA Comparisons
- TTA: is Test Time Augmentation to further improve the performance.
TransBTS achieves average Dice scores of 78.69%, 90.98%, 82.85% respectively for ET, WT and TC.
With TTA, TransBTS achieves the Dice scores of 78.93%, 90.00%, 81.94% on ET, WT, TC, respectively, which are comparable or higher results than previous SOTA 3D methods.
- TransBTS achieves Dice scores of 78.73%, 90.09%, 81.73% and HD of 17.947mm, 4.964mm, 9.769mm on ET, WT, TC.
Compared with 3D U-Net, V-Net and Residual 3D U-Net, TransBTS shows great superiority in both metrics with significant improvements.
2.2. Qualitative Results
TransBTS can describe brain tumors more accurately and generate much better segmentation masks by modeling long-range dependencies between each volume.
2.3. Ablation Study
Increasing the length of tokens, by adjusting the output stride (OS) from 16 to 8, leads to a significant improvement on performance.
Although the OS drops from 8 to 4, without the essential increase of sequence length, the performance does not improve or even gets worse.
With d=512 and L=4 achieves the best scores of ET and WT.
L=4 is a “sweet spot” for the Transformer in terms of performance and complexity.
- Skip connections are tried to attach to the first three Transformer layers.
Following the traditional design of skip-connections from U-Net, considerable gains (3.96% and 1.23%) have been achieved for the important ET and TC, thanks to the recovery of low-level spatial detail information.