# Review — Self-Supervised Multi-Modal Hybrid Fusion Network for Brain Tumor Segmentation

## T1, T2, Flair & T1CE Modalities as Inputs for Tumor Segmentation

Self-Supervised Multi-Modal Hybrid Fusion Network for Brain Tumor Segmentation,Self-Supervised Multi-Modal, by Nanjing University of Science and Technology,2022 JHBI(Sik-Ho Tsang @ Medium)Biomedical Image Self-Supervised Learning

2018 … 2022[BT-Unet] [Taleb JDiagnostics’22] [Self-Supervised Swin UNETR]

==== My Other Paper Readings Are Also Over Here ====

- A
**multi-modal brain tumor segmentation framework**is proposed:

- A
**multi-input architecture**that learns features from multi-modal data. **Hybrid Attentional Fusion Block (HAFB)**is proposed to learn the correlation information between multi-modal data via attention.- A
**self-supervised learning (SSL) strategy**is proposed for brain tumor segmentation tasks.

# Outline

**Multi-Modal Model Architecture****Hybrid Attentional Fusion Block (HAFB)****Self-Supervised Learning (SSL) Strategy****Results**

**1. Multi-Modal Model Architecture**

A U-Net design with skip connectionis used as multi-model model architecture, as shown above.

## 1.1. Input & Encoder

The network supports independent information extraction from

multiple modalities,a specific encoder is built for each one.

**Each encoder**is made up of**four stages**, with an overall downsampling rate of 16×.**An ASPP block**, as in**DeepLabv3**, is implemented in the**output**, which employs cascaded and parallel atrous convolutions.

The feature maps from different encoders are then

fused by the proposed hybrid attentional fusion block (HAFB), which will be mentioned later.

- In total, there are
**four HAFB models**for all the skip connections.

## 1.2. Decoder

- Symmetrically to the encoder, the decoder
**upsamples images 16×**with**four upsampling stages**.

The high-level semantic feature maps obtained by multiple encoders are

fusedand thenre-scaledto the original image resolution.

## 1.3. Loss Functions

**Multi-modal images**are denoted by**{**, where*x*1, …,*xn*} ∼*PD*({*x*})for 4 modalities.*n*=4- The same
**ground-truth**is used for all modalities.*y* - Correspondingly,
**{**denote*θ*1, . . .,*θn*}**the parameters of each modality-specific encoder**. - A semantic segmentation
**decoder**with**parameters**is used. It is assumed that each sample in the input {*ω**xi*} follows a categorical distribution. The aim can be described as:

- where
*p*represents the optimal*lp*−norm loss.

Concretely,

a weighted sum of the Dice loss and cross-entropy lossis used as the training target:

- where
*M*is the number of classes and*N*is the number of pixels. - And the trade-off parameters are
and*α*=1.*β*=0.5

## 1.4. Backbone

**2. Hybrid Attentional Fusion Block (HAFB)**

## 2.1. Upsampled Feature Fusion

- In U-Net,
**simple concatenation**is used for fusion for the upsampled feature at higher level to concatenate with feature at current level:

- In the proposed network, the above network is used for fusion at different levels.

Specifically, the upsampled feature map is

global average pooled (GAP)to obtain an output of dimensionsC1 × 1 × 1. Usinga 1 × 1 convolution φ, the parameters are scaled toC2 × 1 × 1, thenmultiplythem with the skip connection:

## 2.2. Multi-Modal Feature Fusion

Besides, there are multi-modal encoders at current level. That means there are

multi-modal encoder featuresfor fusion.

- To fuse them, one way is to use simple concatenation:

- where
**{**of size*m*1, …,*mn*}*C*×*H*×*W*be the feature maps from**an***n*-way modality-specific network. - However, when
*n*is large, this feature map will become too long, and the number of network parameters will also increase, which is difficult to control.

In this paper,

element-wise summation, elementwise product, and element-wise maximumare used at the same time before concatenation:

- Even if the number of feature maps
*n*is uncertain, the output of*F*can always be adaptively maintained as a fixed length, which is**three times the number of feature map channels**. **Batch normalization (BN)**layer is used after each operation before the concatenation to prevent data overflow.- The concatenated feature map is then passed through
**an attentional module**.

A convolutional layeris first used to reduce the dimensionality of the feature map of sizeφ1C×H×W, then it is restored to the size of 3C×H×Wthrough thesecond convolutional layerto improve the expressive ability. Aφ2Sigmoidfunction is then used to limit the range from 0 to 1:

The role of the

fusion weightsis to capture the important information inWF, so the atttended featureFAis:

Finally, the dimensions of the refined feature map

FAis reduced toC×H×Wthrough theconvolutional layer. The complete HAFB is:φ3

- All the above convolutional layers are of size 3 × 3. And ReLU is used.

**3. Self-Supervised Learning (SSL) Strategy**

Maskingis used as thepretext task.

**20 × 20**is used as the size of the masked area, which correspond to the average size of a tumor.- For a group of
**multi-modal images {**, a 20 × 20 pixels region in each modality is*x*} = {*xt*1,*xt*2,*xt*1*ce*,*xflair*}**masked**to obtain a new group of images**{x’} = {**.*x’t*1,*x’t*2,*x’t*1*ce*,*x’flair*} **{**are sent to the*x*} and {*x*’}**upper and the lower branches**of the network. In the output of the encoders,**two feature maps**are obtained.*θ*{*x*} and*θ*{*x*’}

The

cosine distanceis used to estimate thesimilarity loss:

- The
**total loss**is:

- (To me, SSL is usually used as pretraining without the use of labels, then the pretrained model is further fine-tuned with labels. In this paper, I think similarity loss (SSL here) and supervised loss is used for training jointly.)

# 4. Results

## 4.1. SOTA Comparisons

The proposed method obtains the

best average Dice scoreand also leads in sensitivity and specificity performance.

## 4.2. Combinations of Different Modalities

When using

all the modalities,better performanceis achieved.

## 4.3. Qualitative Results

Compared to other networks,

the proposed method is more sensitive to small areas.Further, even with missing modalities, the proposed approach does not completely fail.

## 4.3. Complexity

With multi-modalities,

more parametersare used.Higher FLOPsare obtained.

## 4.4. Ablation Studies

The SSL or the similarity lossstrengthens the generalization ability of the network and make itmore robust.

- Channel-wise concatenation, element-wise summation, elementwise product, element-wise maximum, and the proposed HAFB are tested.

The proposed

HAFBachievesthe best scorefor each classification and outperforms channel-wise concatenation by 0.6%.

## 4.5. Limitations

Some dissociated details are missingin these results.