Brief Review — RefineU-Net: Segmentation with Attention

Consists of 3 Modules: EM, GRM & LRM

Sik-Ho Tsang
5 min readFeb 20, 2023

RefineU-Net: Improved U-Net with Progressive Global Feedbacks and Residual Attention Guided Local Refinement for Medical Image Segmentation,
RefineU-Net, by A ∗STAR,
2020 J. Pattern Recognition Letters, Over 20 Citations (Sik-Ho Tsang @ Medium)
Medical Image Analysis, Medical Imaging, Image Segmentation, U-Net
==== My Other Paper Readings Are Also Over Here ====

  • RefineU-Net is proposed, which consists of three modules: encoding module (EM), global refine- ment module (GRM) and local refinement module (LRM).
  • EM is backboned by pretrained VGG-16 using ImageNet.
  • GRM progressively upsamples the top side output of EM and fuses the resulted upsampled features with the side outputs of EM at each resolution level.
  • LRM uses residual attention gate (RAG) to generate discriminative attentive features to be concatenated with the decoded features in the expansive path of U-Net.


  1. RefineU-Net
  2. Results

1. RefineU-Net

Overall structure of the proposed RefineU-Net.

1.1. Encoding Module (EM) (Blue)

  • ImageNet-pretrained VGG-16 is used as backbone, i.e. EM.

1.2. Global Refinement Module (GRM) (Yellow)

  • GRM upsamples the feature by a factor 2 using a 4-by-4 transposed convolution layer and then fuses the resulted upsampled feature with the side output from the previous adjunct block in EM using L2 normalization fusion approach:
  • where Gl represents the fused feature corresponding to the l-th side output of EM. fl represents the transposed convolution layer followed by the ReLU activation function. gl normalizes an input x as:
  • where γl is a learnable scalar and ||.||2 denotes L2 norm of a vector.

GRM generates the multi-level fused outputs Gl which effectively summarizes the global context and semantic information.

1.3. Local Refinement Module (LRM) (Green)

The illustration of the proposed Residual Attention Gate (RAG).
  • Gl and Dl+1 (l=1, 2, 3, 4), first go through two 1-by-1 convolution layers.
  • Secondly, the decoded features are bilinearly upsampled to be the same size resolution as the output features of GRM.
  • Then, the resulted features from two input paths are L2 normalization fused using the function gl.
  • Another 3-by-3 convolution layer is used to make the fused feature be the same resolution and depth as Gl.
  • The output feature then goes though an element-wise sigmoid function operation to generate an attention map and element-wisely multiply Gl with the attention map to generate the residual attentive signal and add it to Gl to produce the final attentive features. (This is the concept from Residual Attention Network.)
  • The final attentive feature from RAG is concatenated with the decoded features.
  • At the end of LRM, a 1-by-1 convolution layer and sigmoid are appended to produce the binary segmentation map.

The local refinement module (LRM) modifies the original decoding path of U-Net by incorporating attention mechanism to enhance local refinement.

1.4. Loss Function

  • Two losses are exploited, namely binary cross-entropy loss Lbce and intersection over union loss Liou.
  • The total loss L is:

2. Results

2.1. Datasets

An illustration of the example images and the ROI masks.
  • Two polyp segmentation datasets from MICCAI 2015 polyp detection challenge [22,23] and two skin lesion segmentation datasets from ISBI2016 and ISBI2017, are used.

2.2. Ablation Study

The performance statistics (in %) for the ablation study on GRM and LRM.

The one using both GRM and LRM is the best.

The performance statistics (in %) for the ablation study on the residual architecture of RAG.

Using RAG is better than Attention Gate (AG) in Attention U-Net.

2.3. SOTA Comparisons

The PR curves plotted for the comparison with the benchmarking U-Net methods tested on (a) ETIS-LaribPolypDB; (b) CVC-ColonDB; © ISBI2016; (d) ISBI2017.
Qualitative segmentation results of the compared methods.
The performance statistics (in %) for the comparison with the benchmarking U-Net methods.
  • RefineU-Net produces the most similar segmentation results to the groundtruth.

To summarize, both quantitatively and qualitatively, the proposed RefineU-Net generally outperforms the competing state-of-the-art U-Net methods.

2.4. Visualizations

An illustration of the attentive regions learned by LRM.

The residual attention gate (RAG) contributes to the improvement of segmentation performance through local refinement.

The illustrations of multi-level attention maps: (a) the original test image; (b)–(f): the attention maps for the multi-level RAGs guided by G i (i = 1, 2, …, 5); (g) the groundtruth mask.

RefineU-Net progressively refines the ROI segmentation by alternatively paying attention to the predicted ROI and the background regions.


[2020 J. Pattern Recognition Letters] [RefineU-Net]
RefineU-Net: Improved U-Net with Progressive Global Feedbacks and Residual Attention Guided Local Refinement for Medical Image Segmentation

4.2. Biomedical Image Segmentation

2015-2020 … [RefineU-Net] 2021 [Expanded U-Net] [3-D RU-Net] [nnU-Net]



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.