Review — UNETR: Transformers for 3D Medical Image Segmentation

UNETR, ViT as Encoder, CNN as Decoder

Sik-Ho Tsang
5 min readMar 5


UNETR consists of a transformer encoder that directly utilizes 3D patches and is connected to a CNN-based decoder via skip connection.

UNETR: Transformers for 3D Medical Image Segmentation,
UNETR, by NVIDIA, and Vanderbilt University,
2022 WACV, Over 340 Citations (Sik-Ho Tsang @ Medium)
Medical Imaging, Medical Image Analysis, Image Segmentation, U-Net, Transformer, Vision Transformer, ViT

4.2. Biomedical Image Segmentation
2015 … 2021
[Expanded U-Net] [3-D RU-Net] [nnU-Net] [TransUNet] [CoTr] [TransBTS] [Swin-Unet]
==== My Other Paper Readings Also Over Here ====

  • The task of volumetric (3D) medical image segmentation is reformulated as a sequence-to-sequence prediction problem.
  • UNEt TRansformers (UNETR) is introduced that utilizes a Transformer as the encoder to learn sequence representations of the input volume and effectively capture the global multi-scale information, while also following the “U-shaped” network design for the encoder and decoder.


  1. UNEt TRansformers (UNETR)
  2. Results

1. UNEt TRansformers (UNETR)

1.1. 3D ViT as Backbone

  • A 1D sequence of a 3D input volume x of size H×W×D×C is created with resolution (H, W, D) and C input channels by dividing it into flattened uniform non-overlapping patches xv of N×(P³C), where (P, P, P) denotes the resolution of each patch and N=(H×W×D)/P³ is the length of the sequence. Patch resolution P=16.
  • Subsequently, a linear layer is used to project the patches into a K dimensional embedding space.
  • A 1D learnable positional embedding Epos of size N×K is added to the projected patch embedding E of size (P³CK:
  • The learnable [class] token is not added. Embedding size K=768.
  • A stack of Transformer blocks is used, which comprises of multi-head self-attention (MSA) and multilayer perceptron (MLP) sublayers:
  • where Norm() denotes layer normalization, MLP comprises of two linear layers with GELU activation functions, i is the intermediate block identifier, and L is the number of Transformer layers.
  • The Transformer-based encoder follows the ViT-B16, with L=12.
  • A MSA sublayer comprises of n parallel self-attention (SA) heads. The attention weights (A) are computed by measuring the similarity between two elements in z and their key-value pairs:
  • where Wmsa represents the multi-headed trainable parameter weights.

1.2. U-Net-Like Encoder Decoder Architecture

Overview of UNETR architecture.
  • Similar to U-Net, features from multiple resolutions of the encoder are merged with the decoder, a sequence representation zi (i ∈ {3,6,9,12}) is extracted, with size (H×W×D/P³)×K.
  • At each resolution, the reshaped tensors are projected from the embedding space into the input space by utilizing consecutive 3×3×3 convolutional layers that are followed by batch normalization layers.
  • At the bottleneck of encoder, a deconvolutional layer is applied to the transformed feature map to increase its resolution by a factor of 2.
  • The resized feature map is concatenated with the feature map of the previous Transformer output (e.g. z9), and fed into consecutive 3×3×3 convolutional layers and the output is upsampled using a deconvolutional layer. This process is repeated for all the other subsequent layers up to the original input resolution.
  • The final output is fed into a 1×1×1 convolutional layer with a softmax activation function to generate voxel-wise semantic predictions.

1.3. Loss Function

  • The loss function is a combination of soft dice loss and cross-entropy loss:

2. Results

2.1. BTCV

Quantitative comparisons of segmentation performance in BTCV test set. Top and bottom sections represent the benchmarks of Standard and Free Competitions respectively.

UNETR outperforms the state-of-the-art methods for both Standard and Free Competitions on the BTCV leaderboard.

Qualitative comparison of different baselines in BTCV cross-validation.

UNETR shows improved segmentation performance for abdomen organs.

2.2. MSD

Quantitative comparisons of the segmentation performance in brain tumor and spleen segmentation tasks of the MSD dataset.

For brain segmentation, UNETR outperforms the closest baseline by 1.5% on average over all semantic classes. In particular, UNETR performs considerably better in segmenting tumor core (TC) sub-region.

Similarly for spleen segmentation, UNETR outperforms the best competing methodology CoTr by least 1.0% in terms of Dice score.

UNETR effectively captures the fine-grained details in segmentation outputs.

UNETR demonstrates better performance in capturing the fine-grained details of tumors.

2.3. Ablation Studies

Effect of the decoder architecture on segmentation performance. NUP, PUP and MLA denote Naive UpSampling, Progressive UpSampling and Multi-scale Aggregation.
  • The encoder of UNETR is still used but the decoder is replaced with 3D counterparts of Naive UPsampling (NUP), Progressive UPsampling (PUP) and MuLti-scale Aggregation (MLA) from SETR.

Yet, these decoder architectures yield sub-optimal performance. UNETR outperforms MLA, PUP and NUP by 1.4%, 2.3% and 3.2%.

Effect of patch resolution on segmentation performance.

Decreasing the patch resolution from 32 to 16 improves the performance by 1.1% and 0.8% in terms of average Dice score in spleen and brain segmentation tasks respectively.

Comparison of number of parameters, FLOPs and averaged inference time for various models in BTCV experiments.
  • UNETR is a moderate-sized model with 92.58M parameters and 41.19G FLOPs.

UNETR outperforms these CNN-based models while having a moderate model complexity. UNETR has the second lowest averaged inference time after nnUNet and is significantly faster than Transformer-based models such as SETR, TransUNet and CoTr.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.