Reading: UNet 3+ — A Full-Scale Connected UNet (Medical Image Segmentation)

Outperforms UNet++, Attention UNet, UNet, PSPNet, DeepLabV2, DeepLabV3 and DeepLabV3+

5 min readJul 7, 2020

In this story, UNet 3+, by Zhejiang University, Sir Run Run Shaw Hospital, Ritsumeikan University, and Zhejiang Lab, is briefly presented. UNet++ uses nested and dense skip connections, but it does not explore sufficient information from full scales. In UNet 3+, full-scale skip connections and deep supervisions are used:

Full-scale skip connections: incorporate low-level details with high-level semantics from feature maps in different scales.
Full-scale deep supervision: learns hierarchical representations from the full-scale aggregated feature maps.
A hybrid loss function and a classification-guided module (CGM) are further proposed.

This is a paper in 2020 ICASSP. (Sik-Ho Tsang @ Medium)

Outline

Full-Scale Skip Connection
Full-scale Deep Supervision
Experimental Results

1. Full-Scale Skip Connection

**Left:** **UNet, Middle** **UNet++, Right: UNet 3+**

Both UNet with plain connections and UNet++ with nested and dense connections are short of exploring sufficient information from full scales, failing to explicitly learn position and boundary of an organ.
Each decoder layer in UNet 3+ incorporates both smaller- and same-scale feature maps from encoder and larger-scale feature maps from decoder, which capturing fine-grained details and coarse-grained semantics in full scales.

**Example of Full-Scale Skip Connection**

To construct the feature map of 𝑋3De, similar to the UNet, the feature map from the same-scale encoder layer 𝑋3En.
In contrast to the UNet, a set of inter encoder-decode skip connections delivers the low-level detailed information from the smaller-scale encoder layer 𝑋1En and 𝑋2En , by applying non-overlapping max pooling operation.
A chain of intra decoder skip connections transmits the high-level semantic information from larger-scale decoder layer 𝑋4De and 𝑋5De, by utilizing bilinear interpolation.
For the sake of the channel reduction, the parameters in UNet 3+ is fewer than those in UNet and UNet++. (There are mathematical proofs here, if interested, please feel free to read the paper.)

2. Full-scale Deep Supervision

**Full-scale Deep Supervision with Classification-Guided Module (CGM).**

2.1. Deep Supervision

UNet 3+ yields a side output from each decoder stage (Sup1 to Sup5), which is supervised by the ground truth.
To realize deep supervision, the last layer of each decoder stage is fed into a plain 3 × 3 convolution layer followed by a bilinear up-sampling and a sigmoid function.

2.2. Loss Function

Multi-Scale Structural SIMilarity index (MM-SSIM) loss is used to assign higher weights to the fuzzy boundary.

Focal loss, originated in RetinaNet, is used, to deal with the class imbalance problem.
Standard IoU loss is used.
Thus, a hybrid loss is developed for segmentation in three-level hierarchy — pixel-, patch- and map-level, which is able to capture both large-scale and fine structures with clear boundaries:

2.3. Classification-Guided Module (CGM)

There are false-positives in a non-organ image.
This may be caused by noisy information from background remaining in shallower layer, leading to the phenomenon of over-segmentation.
To solve this problem, an extra classification task is added, for predicting the input image whether has organ or not.
As shown in the figure above, after passing a series of operations including dropout, convolution, maxpooling and sigmoid, a 2-dimensional tensor is produced from the deepest-level 𝑋5En, each of which represents the probability of with/without organs.
With the help of the argmax function, 2-dimensional tensor is transferred into a single output of {0,1}, which denotes with/without organs.
Subsequently, the single classification output is multiplied with the side segmentation output.
Binary cross entropy loss function is used to train the CGM.

3. Experimental Results

3.1. Datasets

The dataset for liver segmentation is obtained from the ISBI LiTS 2017 Challenge. It contains 131 contrast-enhanced 3D abdominal CT scans, of which 103 and 28 volumes are used for training and testing, respectively.
The spleen dataset from the hospital passed the ethic approvals, containing 40 and 9 CT volumes for training and testing.
Images are cropped to 320×320.

3.2. Comparison with UNet and UNet++

VGGNet and ResNet backbones are tested.
UNet 3+ without deep supervision achieves a surpassing performance over UNet and UNet++, obtaining average improvement of 2.7 and 1.6 point between two backbones performed on two datasets.
UNet 3+ combined with full-scale deep supervision further improved 0.4 point.

**Purple areas: true positive (TP); Yellow areas: false negative (FN); Green areas: false positive (FP).**

UNet3+not only accurately localizes organs but also produces coherent boundaries, even in small object circumstances.

3.3. Comparison with the State of the Art

All results are directly from single-model test without relying on any post-processing tools.
The proposed hybrid loss function greatly improves the performance by taking pixel-, patch-, map-level optimization into consideration.
Moreover, taking advantages of the classification-guidance module (CGM), UNet 3+ skillfully avoids the over-segmentation in complex background.
Finally, UNet 3+ outperforms Attention UNet, PSPNet, DeepLabV2, DeepLabV3 and DeepLabv3+.

It has been a long time not reading paper about biomedical image segmentation.
This is the 1st story in this month !!!

Reference

[2020 ICASSP] [UNet 3+]
UNet 3+: A Full-Scale Connected UNet for Medical Image Segmentation

Biomedical Image Segmentation

[CUMedVision1] [CUMedVision2 / DCAN] [U-Net] [CFS-FCN] [U-Net+ResNet] [MultiChannel] [V-Net] [3D U-Net] [M²FCN] [SA] [QSA+QNT] [3D U-Net+ResNet] [Cascaded 3D U-Net] [Attention U-Net] [RU-Net & R2U-Net] [VoxResNet] [DenseVoxNet][UNet++] [H-DenseUNet] [DUNet] [MultiResUNet] [UNet 3+]