Review — TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

TransUNet, Convolutions+Transformers as Encoder

5 min readFeb 23, 2023

TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation,
TransUNet, by Johns Hopkins University, University of Electronic Science and Technology of China, Stanford University, East China Normal University, and PAII Inc.
2021 arXiv v1, Over 1000 Citations (Sik-Ho Tsang @ Medium)
Medical Imaging, Medical Image Analysis, Image Segmentation, U-Net, Transformer, Vision Transformer, ViT
4.2. Biomedical Image Segmentation
2015 … 2021 [Expanded U-Net] [3-D RU-Net] [nnU-Net] [TransUNet]
==== My Other Paper Readings Are Also Over Here ====

TransUNet is proposed, which merits both Transformers and U-Net.
Transformer encodes tokenized image patches from a convolution neural network (CNN) feature map as the input sequence for extracting global contexts.
On the other hand, the decoder upsamples the encoded features which are then combined with the high-resolution CNN feature maps to enable precise localization.

Outline

TransUNet
Results

1. TransUNet

1.1. Transformer as Encoder

Tokenization is first performed by reshaping the input x into a sequence of flattened 2D patches {xip}, where each patch is of size P×P and N=HW/P² is the number of image patches (i.e., the input sequence length).
The vectorized patches {xip} are projected into a latent D-dimensional embedding space using a trainable linear projection. Learnt specific position embeddings are also added:

where E is the patch projection embedding.
The Transformer encoder consists of L layers of Multihead Self-Attention (MSA) and Multi-Layer Perceptron (MLP) blocks. Therefore the output of the l-th layer can be written as follows:

where LN() denotes the layer normalization operator and zL is the encoded image representation.

1.2. Naive Solution

For segmentation purposes, an intuitive solution is to simply upsample the encoded feature representation zL to the full resolution for predicting the dense output.
To recover the spatial order, the size of the encoded feature should first be reshaped from HW/P² to (H/P)×(W/P).
A 1×1 convolution is used to reduce channel size of the reshaped feature to number of classes, and then the feature map is directly bilinearly upsampled to the full resolution H×W for predicting the nal segmentation outcome.

This naive upsampling baseline is named as “None” in the decoder design. This strategy is not the optimal in a loss of low-level details.

1.3. CNN-Transformer Hybrid as Encoder

TransUNet employs a hybrid CNN-Transformer architecture as the encoder as well as a cascaded upsampler to enable precise localization.

As in the figure, CNN is first used as a feature extractor to generate a feature map for the input. Patch embedding is applied to 1×1 patches extracted from the CNN feature map instead of from raw images.
1) It allows us to leverage the intermediate high-resolution CNN feature maps in the decoding path; and 2) It is found that the hybrid CNN-Transformer encoder performs better than simply using a pure Transformer as the encoder.

1.4. Cascaded Upsampler (CUP)

A cascaded upsampler (CUP) is proposed, as in the figure above.
Multiple upsampling blocks are cascaded for reaching the full resolution from (H/P)×(W/P) to H×W, where each block consists of a 2 upsampling operator, a 3×3 convolution layer, and a ReLU layer successively.

1.5. Model Details

For pure Transformer-based encoder, we simply adopt ViT with 12 Transformer layers.
For the hybrid encoder design, ResNet-50 and ViT are combined, denoted as “R50-ViT”, throught this paper.
All Transformer backbones (i.e. ViT) and ResNet-50 (denoted as “R-50”) were pretrained on ImageNet.
The input resolution and patch size P are set as 224×224 and 16, unless otherwise specified. Therefore, four 2 upsampling blocks are cascaded consecutively in CUP.

2. Results

2.1. CT Dataset

**Comparison on the Synapse multi-organ CT dataset (average dice score % and average hausdorff distance in mm, and dice score % for each organ).**

Compared with ViT-None, ViT-CUP observes an improvement for DSC and HD.
Similarly, compared with ViT-CUP, R50-ViT-CUP also suggests an additional improvement.

Built upon R50-ViT-CUP, TransUNet which is also equipped with skip-connections, achieves the best result among different variants of Transformer-based models.

2.2. Ablation Studies

**Ablation study on the number of skip-connections in TransUNet.**

In the “1-skip” setting, the skip-connection is only added at the 1/4 resolution scale.

The best average DSC and HD are achieved by inserting skip-connections to all three intermediate upsampling steps of CUP except the output layer, i.e., at 1/2, 1/4, and 1/8 resolution scales.
The performance gain of smaller organs is more evident than that of larger organs. These results reinforce the initial intuition of integrating U-Net-like skip-connections into the Transformer design to enable learning precise low-level details.

**Ablation study on the influence of input resolution.**

For TransUNet, changing the resolution scale from 224×224 to 512×512 results in 6.88% improvement in average DSC, at the expense of a much larger computational cost.

**Ablation study on the patch size and the sequence length.**

A higher segmentation performance is usually obtained with smaller patch size.

For the “base” model, the hidden size D, number of layers, MLP size, and number of heads are set to be 12, 768, 3072, and 12, respectively while those hyperparameters for “large” model are 24, 1024, 4096, and 16.