Review — SAUNet: Shape Attentive U-Net for Interpretable Medical Image Segmentation

Shape Attentive U-Net (SAUNet), Using Gated Convolutional Layer (GCL) from Gated-SCNN

Sik-Ho Tsang
5 min readJan 10, 2023

SAUNet: Shape Attentive U-Net for Interpretable Medical Image Segmentation,
SAUNet, by Peter Munk Cardiac Center, University of Toronto, Vector Institute, and University of Waterloo,
2020 MICCAI, Over 70 Citations (

@ Medium)
Medical Imaging, Medical Image Analysis, Image Segmentation, U-Net

  • Shapes are generally more meaningful features than solely textures of images.
  • A secondary shape stream is proposed that captures rich shape-dependent information in parallel with the regular texture stream.


  1. Shape Attentive U-Net (SAUNet)
  2. Results

1. Shape Attentive U-Net (SAUNet)

The proposed Shape Attentive U-Net (SAUNet).

1.1. Overall Architecture

  • The proposed model is composed of two main streams — the shape stream that processes boundary information and the texture stream that is the U-Net, however the encoder is replaced with dense blocks from DenseNet.
  • The shape stream is composed of gated convolutional layers and residual layers.
  • The gated convolutional layers are used to fuse texture and shape information while the residual layers are used to fine-tune the shape features.
  • The Canny edges from the original image is used as supervision signal to train the shape stream.

1.2. Gated Convolutional Layer (GCL) from Gated-SCNN

Shape Stream
  • Gated Convolutional Layer (GCL) is originated from Gated-SCNN.
  • Let C1×1(x) denote the normalized 1×1 convolution on x, R(x) is the residual block from ResNet, which is composed of two normalized 3×3 convolutions with a skip connection.
  • The gated convolutional layer computes an attention map, αl, of boundaries by using information from the shape stream flow and texture stream:
  • where Tt and Sl are the texture stream and shape stream feature maps respectively, and σ and || are sigmoid and concatenation respectively.
  • Then, the output of the gated convolutional layer, ^Sl, is the inputted shape stream feature map Sl element-wise multiplied (⊕) with αl:
  • The feature map of the next layer of the shape stream is computed as:

Ledge is the binary cross entropy loss between the ground truth class boundaries and the predicted class boundaries by the shape stream. Now, an objective of the model is to learn the shapes of the classes correctly.

1.3. Dual Attention Decoder Block

Dual Attention Decoder Block
  • The proposed dual attention decoder block that is comprised of two new components after the standard normalized 3×3 convolution on the concatenated feature maps.

1.3.1. Spatial Attention Path

  • First, the input feature maps go through a normalized 1×1 convolution reduces the number of channels to C/2, followed by another 1×1 convolution, reduces the number of channels to 1.
  • A sigmoid is then applied to obtain Fs. Fs is then stacked channel-wise C times to obtain Fs.

1.3.2. Channel-wise Attention Path

  • The channel-wise attention path is comprised of a squeeze and excitation module as in SENet that produces a scale coefficient in [0, 1] for each channel from the skip connection.
  • Each channel from the skip connection feature map is then scaled by their respective coefficient to obtain Fc.

1.3.3. Fusion

  • Then, Fs and Fc are fused together below:
  • where +1 helps to amplify features rather than zeroing out features.

1.4. Dual-Task Loss Function

  • For image segmentation, the cross entropy loss is calculated as the average cross entropy over all pixels:
  • Dice loss is another common loss function used for image segmentation tasks as it measures the overlap and similarity between two sets:
  • The total loss is the sum of 3 losses:
  • where λ1=λ2=λ3=1 is found to work well.

2. Results

2.1. SUN09

Test set Dice scores for SUN09.

SAUNet outperforms other SOTA approaches.

2.2. AC17

AC17 test set results.

Again, SAUNet outperforms other SOTA approaches. Notably, the performance in the right ventricle class is significantly better than previous works, and this is perhaps due to the help of the shape stream learning the irregular shape of the right ventricle.

2.3. Ablation Study

The Dice scores in percentage of our model with and without the proposed shape stream evaluated on AC17 validation set split.
mIoU scores in percentage of training on SUN09 and testing on our AC17 validation set split.

Shape stream improves a lot for the segmentation performance.

2.4. Visualizations

Top: original MRI image. Middle: model without shape stream prediction. Bottom: model with shape stream prediction.

With shape stream prediction, the segmentation is more accurate.

With using Canny edge as additional supervision signal, and a shape stream for training so as to let the network focusing more onto the shape, SAUNet obtains better results. It is a kind of multi-task learning. One learns segmentation shape, one learns segmentation mask.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.