Brief Review — Breast Tumor Segmentation in Ultrasound Images Using Contextual-Information-Aware Deep Adversarial Learning Framework

cGAN+AC+CAW: Conditional GAN (cGAN), Atrous Convolution (AC), Channel Attention with Weighting (CAW), are Used

Sik-Ho Tsang
6 min readJan 29, 2023
Examples of Breast Ultrasound (BUS) Images

Breast Tumor Segmentation in Ultrasound Images Using Contextual-Information-Aware Deep Adversarial Learning Framework,
cGAN+AC+CAW, by Universitat Rovira i Virgili, Khalifa University of Science and Technology, Aswan University
2020 J. ESWA, Over 30 Citations (

@ Medium)
Medical Imaging, Medical Image Analysis, Image Segmentation

  • U-Net-like model is enhanced by using atrous convolution (AC) to capture spatial and scale context, and using the channel attention along with channel weighting (CAW) mechanisms to promote the tumor-relevant features.
  • Additionally, conditional GAN (cGAN) is used for adversarial learning.
  • This paper should be the journal version of 2019 arXiv: cGAN+AC+CAW but without random forest for classification problem here.


  1. cGAN+AC+CAW
  2. Results


1.1. Overall Architecture

The architecture of the proposed model that consists of Generator (G) and Discriminator (D) networks.
  • The model consists of a generator network that extracts breast tumor relevant features, and a discriminator network that predicts if a label mask is a real or fake segmentation of the input BUS image.

1.2. Generator

  • The plain encoder–decoder structure is modified by inserting an AC block, as in DeepLab, between Conv3 and Conv4, in addition to a CAW block inserted between Conv7 and Dconv1. (AC Block and CAW block detailed diagrams are as shown later below.)
  • Each layer in the encoder section is followed by batch normalization (BN) (except for Conv1) and Leaky ReLU with slope 0.2. The decoder section is a sequence of transposed-convolutional layers followed by batch normalization, and Dropout with rate 0.5 (only in DConv1, DConv 2 and DConv3) with ReLU.
  • The filters of the convolutional and deconvolutional layers are defined by a kernel of 4 × 4, and a stride of 2.
  • Skip connections, as from ResNet, are employed between the corresponding layers in the encoder and decoder sections.
  • After the last decoding layer (Dconv7), the tanh activation function is used to generate a binary mask of the breast tumor.

1.3. Atrous Convolution (AC)

The architecture of the AC block with different rates of AC (r = 1, 2 and 3).
Use of AC to increase the receptive field in order to accommodate variable size and shapes of breast tumors.
  • The first three convolutional layers have a kernel size of 3 × 3 and rates of 1, 6, and 9, respectively. The fourth convolutional layer has a kernel size of 1 × 1 followed by a global average pooling (GAP).
  • An up-sampling layer is employed after each branch and then all features are concatenated.

AC can increase the receptive field, and thus it accommodates to variable size and shapes of breast tumors. The employed AC block can mitigate the problem of loss of small tumor-relevant features due to the consecutive downsampling layers.

1.4. CAW Block

The architecture of the CAW block.
  • The CAW block is an aggregation of a channel attention from DANet with a channel weighting module from SENet.
  • It has two branches: the channel attention process (top branch) and the channel weighting process (bottom branch). Since the CAW block is placed after the last encoder layer, the processed activation map has spatial dimensions (H × W) of 1 × 1: indeed, it is a vector of C = 512 scalars. Hence, the method works only on the channel feature space.
  • In brief, the channel attention part is similar to DANet or Transformer idea, while the channel weighting part is similar to SENet idea. (Please feel free to read the stories)

CAW block dynamically promotes breast tumors relevant features, and enrich the representational power of the highest-level features of the generator network.

1.4. Discriminator

  • It comprises a set of five convolutional layers with kernels of size 4 × 4 with a stride of 2, except for Conv4 and Conv5 where the stride is 1.
  • The batch normalization is used after Conv2 to Conv4. Leaky ReLU with slope 0.2 is the non-linear activation function used after Conv2 to Conv5, while the sigmoid function is used after Conv5.
  • The input of the discriminator is the concatenation of the BUS image and a binary mask.
  • The output of the discriminator is a 10 × 10 matrix having values varying from 0.0 (completely fake) to 1.0 (real).

1.5. Loss Function

  • The loss function of the generator G comprises three terms: adversarial loss (binary cross entropy loss), L1-norm to boost the learning process, and SSIM loss to improve the shape of the boundaries of segmented masks:
  • where z is a random variable.
  • The loss function of the discriminator D is:

2. Results

2.1. Ablation Study

Analyzing different configurations of the proposed method with dataset A and dataset B.

Adding all things together obtains the best results.

The performance of the proposed model with different combinations of loss functions.

The proposed combined loss obtains the highest Dice and IoU.

2.2. SOTA Comparisons

Comparison between the proposed model and six state-of-the-art methods in terms of accuracy, Dice, IoU, sensitivity and specificity, using datasets A and B.
  • Accuracy here should be the per-pixel accuracy rather than image-level classification accuracy.

Using dataset A, the proposed method yielded Dice and IoU scores of 93.76% and 88.82%, respectively, outperformed the second-best segmentation.

Using dataset B, the proposed method yielded Dice and IoU scores of 86.82% and 80.37%, respectively, which is 4% and 5% more than Dice and IoU score of the second-best segmentation method.

2.3. Qualitative Results

Segmentation results of five models with the Dataset A.
Segmentation results of five models with the Dataset B.
  • Red: High false negatives, Green: False positive (in green).
  • U-Net provided proper segmentation, but it has a less accurate boundary around the tumor region.

In turn, the proposed model yielded the best segmentation with the highest TP and smallest FP and FN pixels among the five tested methods. Further visualizations of segmentation results of the proposed method can be found at

Examples of incorrect tumor segmentation and localization results.

Some challenging images as above, which have tissues surrounding the tumors have poor contrast as well as ambiguous boundaries.

2.4. Inference Time

  • The execution time (inference time) of each segmentation model. FCN, U-Net, SegNet, ERFNet, DCGAN, and the proposed model achieve 35.15, 20.33, 17.71, 78.78, 18.27, and 19.62 frames per second (FPS), respectively.
  • Although FCN achieves 35.15 FPS, its IoU and Dice values with both BUS image datasets are much lower than the ones of the proposed model.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.