Brief Review — DCSAU-Net: Deeper and more Compact Split-Attention U-Net

DCSAU-Net, U-Net With PFA Strategy and CSA Block

6 min readMar 30, 2023

DCSAU-Net: A Deeper and More Compact Split-Attention U-Net for Medical Image Segmentation,
DCSAU-Net, by University of Lincoln, and Zhejiang Gongshang University,
2023 Elsevier J. Computers in Biology and Medicine (Sik-Ho Tsang @ Medium)
Biomedical Image Segmentation
2015 … 2021 [Expanded U-Net] [3-D RU-Net] [nnU-Net] [TransUNet] [CoTr] [TransBTS] [Swin-Unet] [Swin UNETR] [RCU-Net] [IBA-U-Net] [PRDNet] [Up-Net] [SK-Unet] 2022 [UNETR] [Half-UNet] [BUSIS] [RCA-IUNet]
==== My Other Paper Readings Also Over Here ====

A Deeper and more Compact Split-Attention U-shape Network (DCSAU-Net) is proposed, which efficiently utilises low-level and high-level semantic information based on two modules: primary feature conservation and compact split-attention block.

Outline

DCSAU-Net
Results

1. DCSAU-Net

1.1. Overall Architecture

**DCSAU-Net with PFC strategy and CSA block.**

The encoder of DCSAU-Net first uses PFC strategy to extract low-level semantic information from the input images.
The CSA block applies multi-path feature groups with a different number of convolutions and the attention mechanism.

Each of block is followed by a 2×2 max pooling with stride 2 for performing a downsampling operation.
Decoder subnetwork starts with an upsampling operator to recover the original size of the input image step by step.
The skip connections are used to concatenate these feature maps with the feature maps from the corresponding encoder layer, which mixes low-level and high-level semantic information to generate a precise mask.
Finally, a 1×1 convolution succeeded by a sigmoid or softmax layer is used to output the binary or multi-class segmentation mask.
Dice loss is used as loss function.

1.2. PFC (Primary Feature Conservation) Strategy

**Comparing our PFC strategy with** **U-Net, Stem block [44] and ResUNet++ [27] designs used to extract the low-level semantic information from the input images.**

(a) U-Net: uses two 3×3 convolutions at the beginning for low-level feature extraction.
(b) Stem Block [44]: uses three 3×3 convolutions to obtain the same receptive field as 7×7 convolution and reduce the number of parameters.
(c) ResUNet++ [27]: uses three 3×3 convolutions with skip connection to mitigate the potential impact of the gradient vanish.

(d) PFC: A new primary feature conservation (PFC) strategy is proposed. The main refinement of the module adopts depthwise separable convolution, as in MobileNetV1, consisting of 7×7 depthwise convolution followed by 1×1 pointwise convolution.

1.3. CSA (Compact Split-Attention) Block

The ResNeSt utilises large channel-split groups for feature extraction. In this paper, 2 groups (N=2) are used to reduce the number of parameters.

Both two groups involve one 1×1 convolution followed by one 3×3 convolution.
To improve the representation across channels, the output feature maps of the other group (𝐹2) will sum the result of the first group (𝐹1) and go through another 3×3 convolution, which can receive semantic information from both split groups and expand the receptive field of the network. The summation of F1 and F2 (The summation at the middle) is:

The channel-wise statistics generated by global average pooling (GAP) collect global spatial information:

The channel-wise soft attention is used for aggregating a weighted fusion represented by cardinal group representation, where a split weighted combination can catch crucial information in feature maps. Then the 𝑐th channel of feature maps is calculated as:

where 𝑎𝑖 is a (soft) assignment weight designed by:

Here Gci indicates the weight of global spatial information 𝑆 to the 𝑐th channel and is quantified using two 1×1 convolutions with BatchNorm and ReLU activation.
Finally, the full CSA block is designed with a standard residual architecture (ResNet) that the output 𝑌 is calculated using a skip connection : 𝑌=𝑉+X.

2. Results

2.1. SOTA Comparisons

**Details of the medical segmentation datasets used in our experiments.**

5 datasets are used for training and evaluation.
Dice Score, DSC, is measured.

CVC-ClinicDB: In Table 2, DCSAU-Net achieves a DSC of 0.916 and a mIoU of 0.861, which outperforms DoubleU-Net by 2.0% in terms of DSC and 2.5% in mIoU. Particularly, the proposed model provides a significant improvement over the two recent Transformer-based architectures, where the mIoU of DCSAU-Net is 6.2% and 10.7% higher than TransUNet and LeViTUNet.
SegPC-2021: In Table 3, Compared with other SOTA models, DCSAU-Net displays the best performance in all defined metrics. Specifically, the proposed method produces a mIoU score of 0.8048 with a more significant rise of 3.6% over UNet++ and 2.8% in DSC compared to the DoubleU-Net architecture.
2018 data science bowl: In Table 4, DCSAU-Net achieves a DSC of 0.914 which is 1.9% higher than TransUNet and mIoU of 0.850, which is 2.5% higher than UNet 3+.

ISIC-2018: In Table 6, DCSAU-Net has an increase of 2.4% over LeViT-UNet in this metric, and 1.8% over UNet 3+ in DSC. The proposed model achieves a recall of 0.922 and an accuracy of 0.960, which is better than other baseline methods. Also, a high recall score is more favorable in clinic applications.
BraTS 2021: In Table 7. It can be revealed that DCSAU-Net achieves a DSC of 0.788 mIoU of 0.703, which is 1.7% and 2.1% higher than ResUnet++ respectively.

2.2. Ablation Study

**Ablation study of the DCSAU-Net architecture.**

Although U-Net performs a shorter inference time than the DCSAU-Net model, the proposed approach uses a tiny number of parameters in the equal output feature channels and also expends acceptable inference time, which is more suitable for deployment on machines with limited memory.

**Different kernel sizes in the PFC block.**

The impact of depthwise convolution with different number of kernel sizes is studied. 7×7 obtains the best performance.

2.3. Visualizations

**Qualitative comparison results between DCSAU-Net and other SOTA models on challenging images of five different medical segmentation datasets.**

From the qualitative results, the segmentation mask generated by the proposed model is able to capture more proper foreground information from low-quality images, such as incomplete staining or obscurity.