Review — LeViT-UNet: Make Faster Encoders with Transformer for Medical Image Segmentation
LeViT-UNet: Make Faster Encoders with Transformer for Medical Image Segmentation,
LeViT-UNet, by Wuhan Institute of Technology, and Huazhong University of Science and technology,
2021 arXiv v1, Over 40 Citations (Sik-Ho Tsang @ Medium)
- LeViT-UNet is proposed, which integrates a LeViT Transformer module into the U-Net architecture, for fast and accurate medical image segmentation.
- Specifically, LeViT is used as the encoder of the LeViT-UNet, which better trades off the accuracy and efficiency of the Transformer block.
- Moreover, multi-scale feature maps from Transformer blocks and convolutional blocks of LeViT are passed into the decoder via skip-connection, which can effectively reuse the spatial information of the feature maps.
1.1. Overall Architecture
1.2. LeViT as Encoder
- It consists of two main parts of components: convolutional blocks and Transformer blocks.
- Specifically, there are 4 layers of 3×3 convolutions with stride 2 in the convolutional blocks, which could perform the resolution reduction.
- Depending on the number of channels fed into the first Transformer block, three types of LeViT encoder are designed, which are named as LeViT-128s, LeViT-192 and LeViT-384, respectively.
- The features from convolution layers and Transformer blocks are concatenated in the last stage of the encoder, which could fully leverage the local and global features in various scales.
- The Transformer block can be formulated as MLP then MSA:
- where self-attention is computed as follows:
1.3. CNNs as Decoder
- Similar to U-Net, the features from the encoder are concatenated with skip connection. The cascaded upsampling strategy is used to recover the resolution from the previous layer using CNNs.
- Each block consists of two 3×3 convolution layers, batch normalization layer, ReLU layer, and an upsampling layer.
2.1. SOTA Comparisons on Synapse
LeViT-UNet-384 achieves the best performance in terms of average HD with 16.84 mm, which is improved by about 14.8 mm and 4.7 mm comparing the recently SOTA methods.
Comparing with the Transformer-based method, like TransUNet and Swin-Unet, and other convolution-based method, like U-Net and Attention U-Net (Att-UNet), LeViT-UNet still could achieve the competition result in terms of DSC.
While the other three methods are more likely to under-segment or over-segment the organs, LeViT-UNet outputs are relatively smoother than those from other methods, which indicates that it has more advantageous in boundary prediction.
2.3. Fast Segmentation Comparisons
LeViT-UNet-384 achieves 78.53% mDSC and 16.84mm mHD, which is the best among all methods. It is much faster than TransUNet.
ENet (114 fps) and FPENet (160 fps) are slightly faster than LeViT-UNet-128s (114 fps), yet their HD are still needed to improve. Therefore, it can be concluded that LeViT-UNet is competitive with the current pure CNN efficient segmentation method with better performance.
2.4. Ablation Study
- “1-skip” setting means that one time of skip-connection at the 1/2 resolution scale. “2-skip”, “3-skip” and “4-skip” are inserting skip-connections at 1/2, 1/4, 1/8 and 1/16, respectively.
- Adding more skip-connections could result in better performance. Moreover, the performance gain of smaller organs is much obvious.
DSC is higher without pre-training by the LeViT-UNet-128s and LeViT-UNet-192. However, as the LeViT-UNet-384, pre-training is helpful.