TransBTSV2: Towards Better and More Efficient Volumetric Segmentation of Medical Images

TransBTSV2, Hybird CNN + Transformer, Better Than TransBTSV1

Sik-Ho Tsang
5 min readMay 31


Examples of medical images with the corresponding semantic segmentation annotations.

TransBTSV2: Towards Better and More Efficient Volumetric Segmentation of Medical Images,
TransBTSV2, by University of Science and Technology Beijing, and University of Central Florida,
2022 arXiv v3 (Sik-Ho Tsang @ Medium)

Biomedical Image Segmentation
2015 … 2022 [UNETR] [Half-UNet] [BUSIS] [RCA-IUNet] [Swin-Unet] [DS-TransUNet] [UNeXt] [AdwU-Net] 2023 [DCSAU-Net] [RMMLP]
==== My Other Paper Readings Are Also Over Here ====

  • TransBTSV2 is proposed, which is a hybrid CNN-Transformer architecture for volumetric segmentation.
  • Furthermore, Deformable Bottleneck Module (DBM) to capture shape-aware local details.


  1. TransBTSV2
  2. Results

1. TransBTSV2

  • Given an input medical image X of size C×H×W×D, a modified 3D CNN is utilized to efficiently generate compact feature maps capturing volumetric spatial features, and then the redesigned Transformer encoder is leveraged to model the long-distance dependencies in a global space.
  • After that, the upsampling and convolutional layers are repeatedly applied to gradually produce a high-resolution segmentation result.

1.1. CNN Encoder

  • Due to the computational complexity of Transformer is quadratic with respect to the number of tokens, it is difficult to use Transformer to model the image local context information across spatial and depth dimensions for volumetric segmentation.
  • A stack of convolutional layers with downsampling are employed (strided convolution with stride=2) to gradually encode input images into low-resolution/high-level feature representations F, which is 1/8 of input dimensions of H, W and D (overall stride (OS)=8).

In this way, rich local 3D context features are effectively embedded. Then, F is fed into the Transformer encoder to further learn long-range correlations with global receptive field.

1.2. Feature Embedding of Transformer Encoder

  • Concretely, a 3×3×3 convolutional layer is firstly used to increase the channel dimension from K=128 to d=512.
  • As the Transformer block expects a sequence as input, the spatial and depth dimensions are collapsed into one dimension, resulting in a d×N feature f (i.e. N d-dimensional tokens).

To encode the location information that is necessary for segmentation task, the learnable position encodings are introduced and fused with the feature map f by direct addition, creating the feature embeddings as follows:

  • where W is the feature expansion module, PE is positional embeddings, and z0 is feature embeddings.

1.3. Transformer Blocks

The Transformer encoder is composed of L redesigned Transformer blocks, and each of them has a modified architecture, consisting of a flexibly widened multihead self-attention (FW-MHSA) block and a feed-forward Network (FFN). The output of the l-th (l ∈ [1, 2, …, L]) Transformer block can be calculated by:

1.4. Prefer Wider to Deeper Transformer Blocks

Here, the number of Transformer blocks L is reduced from 4 to 1, with wider feature vectors with an expansion ratio E, the hidden dimension of q and k are expanded to dm (i.e. dm=Ed) to further increase the Transformer width. v remains unchanged (i.e. d = 512) to keep the dimensions of input and output consistent.

  • Therefore, a single scaled dot-product self-attention in the FW-MHSA block:
  • DWConv is the 3D depth-wise convolution layer, which is introduced to bring the local inductive bias into the modified Transformer architecture and further control the computational complexity:

Compared with deeper and narrower architecture, the design of the wider and shallower counterpart allows for more parallel processing, easier optimization and greatly reduced latency.

1.4. Decoder

  • A feature restoration module is used to project the sequence data back to a standard 4D feature map.
  • Two convolutional layers are employed to reduce the channel dimension from d to K. And the feature maps Z is obtained.

After the feature mapping, cascaded upsampling operations and convolutional layers are applied to Z to gradually recover a full resolution segmentation result R.

  • Moreover, skip-connections are employed to fuse the encoder features with the decoder counterparts by concatenation for finer segmentation masks with richer spatial details.
Deformable Bottleneck Module (DBM)
  • Due to the fixed geometric structures of CNN basic modules, CNNs are inherently limited to model irregular-shape deformation. To solve this problem, the proposed Deformable Bottleneck Module is designed to further capture shape-aware features from irregular-shape lesion regions.
  • 3D deformable convolutional layer, as in DCN (2D version), is used:
  • (Please feel free to read DCN for more details.)
  • This proposed DBMs are plugged right into each skip-connection.
  • In order to minimize the amount of computational overhead brought by the proposed DBM, the two convolutional blocks (i.e. Reduction and Restoration layer) are deployed at both ends of the DBM to reduce and restore channel dimensions, which is as shown above.

2. Results

2.1. SOTA Comparisons

BRaTS 2019
BRaTS 2020
Left: LiTS 2017, Right: KiTS 2019

TransBTSV2 consistently outperforms SOTA approaches for 4 datasets.

2.2. Visual Comparisons

Visual Comparisons

TransBTSV2 obtains better segmentation.

2.3. Ablation Studies

Ablation Studies

With all proposed components, TransBTSv2 generally obtains competitive or higher Dice Score.

2.4. Loss Curves

Loss Curves
  • TransBTSV2 converges faster than TransBTS.
  • (There are other studies shown in the paper, please feel free to read the paper directly.)



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.