Reading: UNet 3+ — A Full-Scale Connected UNet (Medical Image Segmentation)

Outperforms UNet++, Attention UNet, UNet, PSPNet, DeepLabV2, DeepLabV3 and DeepLabV3+

Sik-Ho Tsang
5 min readJul 7, 2020

In this story, UNet 3+, by Zhejiang University, Sir Run Run Shaw Hospital, Ritsumeikan University, and Zhejiang Lab, is briefly presented. UNet++ uses nested and dense skip connections, but it does not explore sufficient information from full scales. In UNet 3+, full-scale skip connections and deep supervisions are used:

  • Full-scale skip connections: incorporate low-level details with high-level semantics from feature maps in different scales.
  • Full-scale deep supervision: learns hierarchical representations from the full-scale aggregated feature maps.
  • A hybrid loss function and a classification-guided module (CGM) are further proposed.

This is a paper in 2020 ICASSP. (Sik-Ho Tsang @ Medium)


  1. Full-Scale Skip Connection
  2. Full-scale Deep Supervision
  3. Experimental Results

1. Full-Scale Skip Connection

Left: UNet, Middle UNet++, Right: UNet 3+
  • Both UNet with plain connections and UNet++ with nested and dense connections are short of exploring sufficient information from full scales, failing to explicitly learn position and boundary of an organ.
  • Each decoder layer in UNet 3+ incorporates both smaller- and same-scale feature maps from encoder and larger-scale feature maps from decoder, which capturing fine-grained details and coarse-grained semantics in full scales.
Example of Full-Scale Skip Connection
  • To construct the feature map of 𝑋3De, similar to the UNet, the feature map from the same-scale encoder layer 𝑋3En.
  • In contrast to the UNet, a set of inter encoder-decode skip connections delivers the low-level detailed information from the smaller-scale encoder layer 𝑋1En and 𝑋2En , by applying non-overlapping max pooling operation.
  • A chain of intra decoder skip connections transmits the high-level semantic information from larger-scale decoder layer 𝑋4De and 𝑋5De, by utilizing bilinear interpolation.
  • For the sake of the channel reduction, the parameters in UNet 3+ is fewer than those in UNet and UNet++. (There are mathematical proofs here, if interested, please feel free to read the paper.)

2. Full-scale Deep Supervision

Full-scale Deep Supervision with Classification-Guided Module (CGM).

2.1. Deep Supervision

  • UNet 3+ yields a side output from each decoder stage (Sup1 to Sup5), which is supervised by the ground truth.
  • To realize deep supervision, the last layer of each decoder stage is fed into a plain 3 × 3 convolution layer followed by a bilinear up-sampling and a sigmoid function.

2.2. Loss Function

  • Multi-Scale Structural SIMilarity index (MM-SSIM) loss is used to assign higher weights to the fuzzy boundary.
  • Focal loss, originated in RetinaNet, is used, to deal with the class imbalance problem.
  • Standard IoU loss is used.
  • Thus, a hybrid loss is developed for segmentation in three-level hierarchy — pixel-, patch- and map-level, which is able to capture both large-scale and fine structures with clear boundaries:

2.3. Classification-Guided Module (CGM)

  • There are false-positives in a non-organ image.
  • This may be caused by noisy information from background remaining in shallower layer, leading to the phenomenon of over-segmentation.
  • To solve this problem, an extra classification task is added, for predicting the input image whether has organ or not.
  • As shown in the figure above, after passing a series of operations including dropout, convolution, maxpooling and sigmoid, a 2-dimensional tensor is produced from the deepest-level 𝑋5En, each of which represents the probability of with/without organs.
  • With the help of the argmax function, 2-dimensional tensor is transferred into a single output of {0,1}, which denotes with/without organs.
  • Subsequently, the single classification output is multiplied with the side segmentation output.
  • Binary cross entropy loss function is used to train the CGM.

3. Experimental Results

3.1. Datasets

  • The dataset for liver segmentation is obtained from the ISBI LiTS 2017 Challenge. It contains 131 contrast-enhanced 3D abdominal CT scans, of which 103 and 28 volumes are used for training and testing, respectively.
  • The spleen dataset from the hospital passed the ethic approvals, containing 40 and 9 CT volumes for training and testing.
  • Images are cropped to 320×320.

3.2. Comparison with UNet and UNet++

Dice on Liver and Spleen Datasets
  • VGGNet and ResNet backbones are tested.
  • UNet 3+ without deep supervision achieves a surpassing performance over UNet and UNet++, obtaining average improvement of 2.7 and 1.6 point between two backbones performed on two datasets.
  • UNet 3+ combined with full-scale deep supervision further improved 0.4 point.
Purple areas: true positive (TP); Yellow areas: false negative (FN); Green areas: false positive (FP).
  • UNet3+not only accurately localizes organs but also produces coherent boundaries, even in small object circumstances.

3.3. Comparison with the State of the Art

Dice on Liver and Spleen Datasets
  • All results are directly from single-model test without relying on any post-processing tools.
  • The proposed hybrid loss function greatly improves the performance by taking pixel-, patch-, map-level optimization into consideration.
  • Moreover, taking advantages of the classification-guidance module (CGM), UNet 3+ skillfully avoids the over-segmentation in complex background.
  • Finally, UNet 3+ outperforms Attention UNet, PSPNet, DeepLabV2, DeepLabV3 and DeepLabv3+.

It has been a long time not reading paper about biomedical image segmentation.

This is the 1st story in this month !!!



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.