Review — DANet: Dual Attention Network for Scene Segmentation

Semantic Segmentation Using Transformer Self Attention

Sik-Ho Tsang
6 min readNov 24, 2022

Dual Attention Network for Scene Segmentation,
DANet, by Chinese Academy of Sciences, JD.com, and University of Chinese Academy of Sciences,
2019 CVPR, Over 3300 Citations (

@ Medium)
Semantic Segmentation, Transformer

  • Dual Attention Network (DANet) is proposed to adaptively integrate local features with their global dependencies.
  • Specifically, two types of attention modules are appended on top of Dilated FCN, which model the semantic interdependencies in spatial and channel dimensions respectively.

Outline

  1. Dual Attention Network (DANet)
  2. Position & Channel Attention Modules (PAM & CAM)
  3. Experimental Results

1. Dual Attention Network (DANet)

An overview of the Dual Attention Network (DANet)
  • A pretrained residual network with the Dilated strategy (DeepLab or DilatedNet) is employed as the backbone.
  • The downsampling operations are removed and Dilated convolutions are employed in the last two ResNet blocks, thus enlarging the size of the final feature map size to 1/8 of the input image.
  • A convolution layer (Grey color block after ResNet) is firstly applied to obtain the features of dimension reduction.
  • Then, the features are fed into two parallel attention modules, particularly the spatial attention module and channel attention module (More details in next section), which can capture long-range contextual information.
  1. The first step is to generate a spatial attention matrix which models the spatial relationship between any two pixels of the features.
  2. Next, a matrix multiplication is performed between the attention matrix and the original features.
  3. Third, an element-wise sum operation is performed on the above multiplied resulting matrix and original features to obtain the final representations reflecting long-range contexts.

2. Position & Channel Attention Modules (PAM & CAM)

2.1. Position Attention Module (PAM)

Position Attention Module
  • The process is similar to the self-attention layer in Transformer.
  • Given a local feature A of size C×H×W, it is first fed into a convolution layers to generate two new feature maps B and C, of size C×H×W, respectively. Then, they are reshaped to C×N, where N=H×W is the number of pixels.
  • After that, a matrix multiplication is performed between the transpose of C and B, and a softmax layer is applied to calculate the spatial attention map S of size N×N:
  • Meanwhile, feature A is fed into a convolution layer to generate a new feature map D of size C×H×W and reshaped to C×N. Then, a matrix multiplication is performed between D and the transpose of S and reshaped to C×H×W.
  • Finally, it is multiplied by a scale parameter α and a element-wise sum operation is performed with the features A to obtain the final output E of size C×H×W:

PAM selectively aggregates the feature at each position by a weighted sum of the features at all positions. Similar features would be related to each other regardless of their distances.

2.2. Channel Attention Module (CAM)

Channel Attention Module
  • The process is similar to PAM.
  • But different from PAM, the original features A is directly reshaped, and multiplied its reshaped & transposed version, then softmax to generate the channel attention map X of size C×C.
  • Then, the result is scaled by a scale parameter β and an element-wise sum operation performed with A to obtain the final output E:

CAM selectively emphasizes interdependent channel maps by integrating associated features among all channel maps.

3. Experimental Results

3.1. Ablation Study

Ablation study on Cityscapes val set.
  • Compared with the baseline FCN (ResNet-50), employing PAM yields a result of 75.74% in Mean IoU, which brings 5.71% improvement.
  • Meanwhile, employing CAM individually outperforms the baseline by 4.25%.

When integrating the two attention modules together, the performance further improves to 76.34%.

When a deeper pre-trained network (ResNet-101) is used with two attention modules, the segmentation performance is significantly improved over the baseline model by 5.03%.

Ablation study on PASCAL VOC 2012 val set.

The proposed attention modules improves performance significantly, where DANet-50 exceeds the baseline by 3.3%. When adopting a deeper network ResNet-101, the model further achieves a Mean IoU of 80.4%.

Performance comparison between different strategies on Cityscape val set.
  1. DA: Data augmentation with random scaling.
  2. Multi-Grid: we apply employ a hierarchy of grids of different sizes (4, 8, 16) in the last ResNet block.
  3. MS: The segmentation probability maps are averaged from 8 image scales {0.5, 0.75, 1, 1.25, 1.5, 1.75, 2, 2.2} for inference.

Finally, with other strategies, the segmentation map fusion further improves the performance to 81.50%, which outperforms well-known method DeepLabv3 (79.30% on Cityscape val set) by 2.20%.

3.2. SOTA Comparisons

Per-class results on Cityscapes testing set.

DANet outperforms existing approaches with dominantly advantage. In particular, DANet outperforms PSANet by a large margin with the same backbone ResNet-101. Moreover, it also surpasses DenseASPP [25].

Segmentation results on PASCAL VOC 2012 testing set.

DANet is on par with PSPNet and EncNet with the same backbone.

Segmentation results on PASCAL Context testing set.

The baseline (Dilated FCN-50) yields Mean IoU 44.3%. DANet-50 boosts the performance to 50.1%.

  • Furthermore, with a deep pretrained network ResNet-101, DANet results achieve Mean IoU 52.6%, which outperforms previous methods by a large margin.
Segmentation results on COCO Stuff testing set.

The above table again shows that DANet could capture long-range contextual information more effectively and learn better feature representation in scene segmentation.

3.3. Visualization & Analysis

Visualization results of PAM on Cityscapes val set.
  • With PAM, some details and object boundaries are clearer, such as the ‘pole’ in the first row and the ‘sidewalk’ in the second row.
  • Selective fusion over local features enhance the discrimination of details.
Visualization results of CAM on Cityscapes val set.
  • With CAM, some misclassified category are now correctly classified, such as the ‘bus’ in the first and third row.
  • The selective integration among channel maps helps to capture context information. The semantic consistency have been improved obviously.
Visualization results of attention modules on Cityscapes val set.
  • The response of specific semantic is noticeable after channel attention module enhances.
  • For example, 11-th channel map responds to the ‘car’ class in all three examples, and 4-th channel map is for the ‘vegetation’ class, which benefits for the segmentation of two scene categories.

Reference

[2019 CVPR] [DANet]
Dual Attention Network for Scene Segmentation

1.6. Semantic Segmentation / Scene Parsing

20152019 [DANet] … 2021 [PVT, PVTv1] [SETR] 2022 [PVTv2]

My Other Previous Paper Readings

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.