Review — LeViT-UNet: Make Faster Encoders with Transformer for Medical Image Segmentation

LeViT-UNet, LeViT as Encoder, CNN as Decoder

Sik-Ho Tsang
4 min readApr 6, 2023

LeViT-UNet: Make Faster Encoders with Transformer for Medical Image Segmentation,
LeViT-UNet, by Wuhan Institute of Technology, and Huazhong University of Science and technology,
2021 arXiv v1, Over 40 Citations (Sik-Ho Tsang @ Medium)

Biomedical Image Segmentation
2015 … 2022
[UNETR] [Half-UNet] [BUSIS] [RCA-IUNet] 2023 [DCSAU-Net]
==== My Other Paper Readings Also Over Here ====

  • LeViT-UNet is proposed, which integrates a LeViT Transformer module into the U-Net architecture, for fast and accurate medical image segmentation.
  • Specifically, LeViT is used as the encoder of the LeViT-UNet, which better trades off the accuracy and efficiency of the Transformer block.
  • Moreover, multi-scale feature maps from Transformer blocks and convolutional blocks of LeViT are passed into the decoder via skip-connection, which can effectively reuse the spatial information of the feature maps.


  1. LeViT-UNet
  2. Results

1. LeViT-UNet

1.1. Overall Architecture

The architecture of LeViT-UNet, which is composed of encoder (LeViT block), decoder and skip connection. Here, the encoder is constructed based on LeViT module.

LeViT module is applied in the encoder part to extract the features and the decoder part is kept the same as U-Net.

1.2. LeViT as Encoder

Block diagram of LeViT-192 architecture. A sampling is applied before transformation in the second and third Trans-Block, respectively.
  • It consists of two main parts of components: convolutional blocks and Transformer blocks.
  • Specifically, there are 4 layers of 3×3 convolutions with stride 2 in the convolutional blocks, which could perform the resolution reduction.
  • Depending on the number of channels fed into the first Transformer block, three types of LeViT encoder are designed, which are named as LeViT-128s, LeViT-192 and LeViT-384, respectively.
  • The features from convolution layers and Transformer blocks are concatenated in the last stage of the encoder, which could fully leverage the local and global features in various scales.
  • The Transformer block can be formulated as MLP then MSA:
  • where self-attention is computed as follows:

1.3. CNNs as Decoder

  • Similar to U-Net, the features from the encoder are concatenated with skip connection. The cascaded upsampling strategy is used to recover the resolution from the previous layer using CNNs.
  • Each block consists of two 3×3 convolution layers, batch normalization layer, ReLU layer, and an upsampling layer.

2.. Results

2.1. SOTA Comparisons on Synapse

Segmentation accuracy (average DSC% and average HD in mm, and DSC for each organ) of different methods on the Synapse multi-organ CT dataset.

LeViT-UNet-384 achieves the best performance in terms of average HD with 16.84 mm, which is improved by about 14.8 mm and 4.7 mm comparing the recently SOTA methods.

Comparing with the Transformer-based method, like TransUNet and Swin-Unet, and other convolution-based method, like U-Net and Attention U-Net (Att-UNet), LeViT-UNet still could achieve the competition result in terms of DSC.

2.2. Visualizations

Qualitative comparison of various methods by visualization From Left to right: Ground Truth, LeViT-UNet-384, TransUNet, U-Net, and DeepLabv3+.

While the other three methods are more likely to under-segment or over-segment the organs, LeViT-UNet outputs are relatively smoother than those from other methods, which indicates that it has more advantageous in boundary prediction.

2.3. Fast Segmentation Comparisons

Mean DSC and HD of the proposed LeViT-UNet compared to other state-of-the-art semantic segmentation methods on the Synapse dataset in terms of parameters and inference speed by FPS (frame per second). Number of parameters are listed in millions.

LeViT-UNet-384 achieves 78.53% mDSC and 16.84mm mHD, which is the best among all methods. It is much faster than TransUNet.

ENet (114 fps) and FPENet (160 fps) are slightly faster than LeViT-UNet-128s (114 fps), yet their HD are still needed to improve. Therefore, it can be concluded that LeViT-UNet is competitive with the current pure CNN efficient segmentation method with better performance.

2.4. Ablation Study

Ablation study w/o Transformer blocks.

Adding Transformer blocks leads to a better segmentation performance in terms of DSC and HD. The Transformer block could improve performance owing to its innate global self-attention mechanisms.

Ablation study on the number of skip-connection in LeViT-UNet. ( ’_N’ means the number of skip connections)
  • “1-skip” setting means that one time of skip-connection at the 1/2 resolution scale. “2-skip”, “3-skip” and “4-skip” are inserting skip-connections at 1/2, 1/4, 1/8 and 1/16, respectively.
  • Adding more skip-connections could result in better performance. Moreover, the performance gain of smaller organs is much obvious.
Ablation study of influence of pretrained strategy. (’-N’ means without pretraining on ImageNet)

DSC is higher without pre-training by the LeViT-UNet-128s and LeViT-UNet-192. However, as the LeViT-UNet-384, pre-training is helpful.

2.5. SOTA Comparisons on ACDC

Segmentation performance of different methods on the ACDC dataset.

Compared with Swin-Unet and TransUNet, LeViT-UNet achieve comparable DSC; for instance, the LeViT-UNet-192 and LeViT-Unet-384 achieve 90.08% and 90.32% DSC.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.