Review — Non-local U-Nets for Biomedical Image Segmentation

U-Net + Transformer Block for Non-Local Attention

6 min readDec 10, 2022

--

Non-local U-Nets for Biomedical Image Segmentation,
Non-local U-Net, by Texas A&M University, and University of North Carolina at Chapel Hill,
2020 AAAI, Over 110 Citations (Sik-Ho Tsang @ Medium)
Medical Imaging, Medical Image Analysis, Image Segmentation, Transformer, U-Net

Non-local U-Net is proposed, which is equipped with flexible global aggregation blocks, for biomedical image segmentation.

Outline

Non-Local U-Net: Residual Block Variants
Non-Local U-Net: Global Aggregation Block
Experimental Results
Ablation Studies

1. Non-Local U-Net: Residual Block Variants

1.1. Framework

**U-Net** **framework employed by the proposed non-local U-Nets.**

The basic U-Net framework is given above.
The input first goes through an encoding input block, which extracts low-level features.
Two down-sampling blocks are used to reduce the spatial sizes and obtain high-level features. The number of channels is doubled after each down-sampling block.
A bottom block then aggregates global information and produces the output of the encoder.
Correspondingly, the decoder uses two up-sampling blocks to recover the spatial sizes for the segmentation output. The number of feature maps is halved after an up-sampling operation.

Skip connections copy feature maps from the encoder to the decoder using summation instead of concatenation. There ar two advantages. First, summation does not increase the number of feature maps, thus reducing the number of trainable parameters in the following layer. Second, skip connections with summation can be considered as long-range residual connections.

1.2. Proposed Residual Blocks

**Residual blocks employed by the proposed non-local U-Nets.**

(a) Regular Residual Block: Skip connection is used with two consecutive convolutional layers. Here, batch normalization with the ReLU6 activation function is used before each convolutional layer. This block is used as the input block. The output block is constructed by this block followed by a 1×1×1 convolution with a stride of 1.
(b) Down-Sampling Residual Block: A 1×1×1 convolution with a stride of 2 is used to replace the identity residual connection.
(c) Bottom Block: Basically, a residual connection is applied on the proposed global aggregation block (which is described in the next section).
(d) Up-Sampling Residual Block: The identity residual connection is replaced by a 3×3×3 deconvolution with a stride of 2 and the other branch is the up-sampling global aggregation block.

2. Non-Local U-Net: Global Aggregation Block

To achieve global information fusion through a block, each position of the output feature maps should depend on all positions of the input feature maps. In fact, a fully-connected layer has this global property. However, it is prone to over-fitting and does not work well in practice.
In this paper, self-attention, used in Transformer, is used. This is also used in non-local neural networks for video classification.

Let X represent the input to the global aggregation block and Y represent the output. For simplicity, Conv_1N denotes a 1×1×1 convolution with a stride of 1 and N output channels.
Left part of figure: The first step of the proposed block is to generate the query (Q), key (K) and value (V) matrices:

where Unfold() unfolds a D×H×W×C tensor into a (D×H×W)×C matrix.
Right part of figure: In the second step, the attention mechanism is applied on Q, K and V. The output of the query vector is a weighted sum of all value vectors, where the weights are normalized through Softmax.

Dropout can be applied on A to avoid over-fitting.
The final step of the block computes Y by:

where Fold() is the reverse operation of Unfold() and CO is a hyper-parameter representing the dimension of the outputs. CK=CV=CO.

For the global aggregation block, QueryTransformCK() is Conv_1CK. For the up-sampling global aggregation block, QueryTransformCK() is a 3×3×3 deconvolution with a stride of 2.

3. Experimental Results

3.1. Comparison with CC-3D-FCN

**Comparison of segmentation performance between our proposed model and the baseline model in terms of DR.**

**Comparison of segmentation performance between our proposed model and the baseline model in terms of 3D-MHD.**

CC-3D-FCN is used as baseline. CC-3D-FCN is a 3D fully convolutional network (3D-FCN) with convolution and concatenate (CC) skip connections.
The Dice ratio (DR) and the 3D modified Hausdorff distance (3D-MHD) are used as the evaluation metrics.

The non-local U-Nets achieve significant improvements over the baseline model.

**Comparison of training processes and validation results.**

The proposed model converges faster to a lower training loss. In addition, according to the better validation results, the proposed model does not suffer from over-fitting.

**Comparison of segmentation performance on the 13 testing subjects of iSeg-2017.**

In the iSeg-2017 challenge, according to the leaderboard, the proposed model achieves one of the top performances.

3.2. Model & Time Complexity

**Comparison of the number of parameters.**

The proposed model reduces 28% parameters compared to CC-3D-FCN and achieves better performance.

3.3. Visualizations

**Visualization of the segmentation results** (The second, third and fourth columns show the binary segmentation maps for CSF, GM and WM, respectively).

The proposed model is capable of catching more details than the baseline model.

4. Ablation Studies

**Ablation study by comparing segmentation performance between different models in terms of DR.**

**Ablation study by comparing segmentation performance between different models in terms of 3D-MHD.**

Model1: is a 3D U-Net without short-range residual connections.
Model2: Model1 with short-range residual connections.
Model3: replaces the first up-sampling block in Model2 with the block in (d).
Model4: replaces both up-sampling blocks in Model2 with the block in (d).
Model5: replaces the bottom block in Model2 with the block in (c).

The results demonstrate how different global aggregation blocks in non-local U-Nets improve the performance.

**Left: Different overlapping step sizes during inference. Middle: Different overlapping step sizes during inference. Right: Different patch sizes.**

(Left) & (Middle) Different overlapping step size: The overlapping step sizes are set to 4, 8, 16, 32. The patch size is set to 32³.
(Middle): According to the overlapping step sizes, 11880, 1920, 387, 80 patches need to be processed during inference.
(Left): Obviously, 8 and 16 are good choices that achieve accurate and fast segmentation results.
(Right): Experiments are conducted with five different patch sizes: 16³, 24³, 32³, 40³, 48³. 32³ obtains the best performance and is selected as the default setting of the proposed model.

Reference

[2020 AAAI] [Non-local U-Net]
Non-local U-Nets for Biomedical Image Segmentation

4.2. Biomedical Image Segmentation

2015 … 2020 [MultiResUNet] [UNet 3+] [Dense-Gated U-Net (DGNet)] [Non-local U-Net]

Review — Non-local U-Nets for Biomedical Image Segmentation

U-Net + Transformer Block for Non-Local Attention

Outline

1. Non-Local U-Net: Residual Block Variants

1.1. Framework

1.2. Proposed Residual Blocks

2. Non-Local U-Net: Global Aggregation Block

3. Experimental Results

3.1. Comparison with CC-3D-FCN

3.2. Model & Time Complexity

3.3. Visualizations

4. Ablation Studies

Reference

4.2. Biomedical Image Segmentation

My Other Previous Readings

Written by Sik-Ho Tsang

No responses yet