Review — Self-Supervised Multi-Modal Hybrid Fusion Network for Brain Tumor Segmentation

T1, T2, Flair & T1CE Modalities as Inputs for Tumor Segmentation

Sik-Ho Tsang
7 min readJun 20, 2023
Four modalities: T1, T2, Flair, and T1CE (from left to right).

Self-Supervised Multi-Modal Hybrid Fusion Network for Brain Tumor Segmentation,
Self-Supervised Multi-Modal, by Nanjing University of Science and Technology,
2022 JHBI (Sik-Ho Tsang @ Medium)

Biomedical Image Self-Supervised Learning
2018 … 2022 [BT-Unet] [Taleb JDiagnostics’22] [Self-Supervised Swin UNETR]
==== My Other Paper Readings Are Also Over Here ====

  • A multi-modal brain tumor segmentation framework is proposed:
  1. A multi-input architecture that learns features from multi-modal data.
  2. Hybrid Attentional Fusion Block (HAFB) is proposed to learn the correlation information between multi-modal data via attention.
  3. A self-supervised learning (SSL) strategy is proposed for brain tumor segmentation tasks.


  1. Multi-Modal Model Architecture
  2. Hybrid Attentional Fusion Block (HAFB)
  3. Self-Supervised Learning (SSL) Strategy
  4. Results

1. Multi-Modal Model Architecture

Multi-Modal Model Architecture

A U-Net design with skip connection is used as multi-model model architecture, as shown above.

1.1. Input & Encoder

The network supports independent information extraction from multiple modalities, a specific encoder is built for each one.

  • Each encoder is made up of four stages, with an overall downsampling rate of 16×.
  • An ASPP block, as in DeepLabv3, is implemented in the output, which employs cascaded and parallel atrous convolutions.

The feature maps from different encoders are then fused by the proposed hybrid attentional fusion block (HAFB), which will be mentioned later.

  • In total, there are four HAFB models for all the skip connections.

1.2. Decoder

  • Symmetrically to the encoder, the decoder upsamples images 16× with four upsampling stages.

The high-level semantic feature maps obtained by multiple encoders are fused and then re-scaled to the original image resolution.

1.3. Loss Functions

  • Multi-modal images are denoted by {x1, …, xn} ∼ PD({x}), where n=4 for 4 modalities.
  • The same ground-truth y is used for all modalities.
  • Correspondingly, {θ1, . . ., θn} denote the parameters of each modality-specific encoder.
  • A semantic segmentation decoder with parameters ω is used. It is assumed that each sample in the input {xi} follows a categorical distribution. The aim can be described as:
  • where p represents the optimal lp−norm loss.

Concretely, a weighted sum of the Dice loss and cross-entropy loss is used as the training target:

  • where M is the number of classes and N is the number of pixels.
  • And the trade-off parameters are α=1 and β=0.5.

1.4. Backbone

2. Hybrid Attentional Fusion Block (HAFB)

2.1. Upsampled Feature Fusion

  • In U-Net, simple concatenation is used for fusion for the upsampled feature at higher level to concatenate with feature at current level:
Feature Fusion at Skip Connection (Contact should be concatenation I think)
  • In the proposed network, the above network is used for fusion at different levels.

Specifically, the upsampled feature map is global average pooled (GAP) to obtain an output of dimensions C1 × 1 × 1. Using a 1 × 1 convolution φ, the parameters are scaled to C2 × 1 × 1, then multiply them with the skip connection:

2.2. Multi-Modal Feature Fusion

Hybrid Attentional Fusion Block (HAFB)

Besides, there are multi-modal encoders at current level. That means there are multi-modal encoder features for fusion.

  • To fuse them, one way is to use simple concatenation:
  • where {m1, …, mn} of size C×H×W be the feature maps from an n-way modality-specific network.
  • However, when n is large, this feature map will become too long, and the number of network parameters will also increase, which is difficult to control.

In this paper, element-wise summation, elementwise product, and element-wise maximum are used at the same time before concatenation:

  • Even if the number of feature maps n is uncertain, the output of F can always be adaptively maintained as a fixed length, which is three times the number of feature map channels.
  • Batch normalization (BN) layer is used after each operation before the concatenation to prevent data overflow.
  • The concatenated feature map is then passed through an attentional module.

A convolutional layer φ1 is first used to reduce the dimensionality of the feature map of size C×H×W, then it is restored to the size of 3C×H×W through the second convolutional layer φ2 to improve the expressive ability. A Sigmoid function is then used to limit the range from 0 to 1:

The role of the fusion weights W is to capture the important information in F, so the atttended feature FA is:

Finally, the dimensions of the refined feature map FA is reduced to C×H×W through the convolutional layer φ3. The complete HAFB is:

  • All the above convolutional layers are of size 3 × 3. And ReLU is used.

3. Self-Supervised Learning (SSL) Strategy

Self-Supervised Learning Strategy

Masking is used as the pretext task.

  • 20 × 20 is used as the size of the masked area, which correspond to the average size of a tumor.
  • For a group of multi-modal images {x} = {xt1, xt2, xt1ce, xflair}, a 20 × 20 pixels region in each modality is masked to obtain a new group of images {x’} = {x’t1, x’t2, x’t1ce, x’flair}.
  • {x} and {x’} are sent to the upper and the lower branches of the network. In the output of the encoders, two feature maps θ{x} and θ{x’} are obtained.

The cosine distance is used to estimate the similarity loss:

  • The total loss is:
  • (To me, SSL is usually used as pretraining without the use of labels, then the pretrained model is further fine-tuned with labels. In this paper, I think similarity loss (SSL here) and supervised loss is used for training jointly.)

4. Results

4.1. SOTA Comparisons

Brats 2019

The proposed method obtains the best average Dice score and also leads in sensitivity and specificity performance.

4.2. Combinations of Different Modalities

Combinations of Different Modalities

When using all the modalities, better performance is achieved.

4.3. Qualitative Results

Qualitative Results

Compared to other networks, the proposed method is more sensitive to small areas. Further, even with missing modalities, the proposed approach does not completely fail.

4.3. Complexity


With multi-modalities, more parameters are used. Higher FLOPs are obtained.

4.4. Ablation Studies

Effectiveness of SSL

The SSL or the similarity loss strengthens the generalization ability of the network and make it more robust.

Fusion Strategies
Fusion Strategies
  • Channel-wise concatenation, element-wise summation, elementwise product, element-wise maximum, and the proposed HAFB are tested.

The proposed HAFB achieves the best score for each classification and outperforms channel-wise concatenation by 0.6%.

4.5. Limitations

Lack of Details

Some dissociated details are missing in these results.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.