Review — DANet: Dual Attention Network for Scene Segmentation

Semantic Segmentation Using Transformer Self Attention

  • Dual Attention Network (DANet) is proposed to adaptively integrate local features with their global dependencies.
  • Specifically, two types of attention modules are appended on top of Dilated FCN, which model the semantic interdependencies in spatial and channel dimensions respectively.

Outline

  1. Dual Attention Network (DANet)
  2. Position & Channel Attention Modules (PAM & CAM)
  3. Experimental Results

1. Dual Attention Network (DANet)

An overview of the Dual Attention Network (DANet)
  • A pretrained residual network with the Dilated strategy (DeepLab or DilatedNet) is employed as the backbone.
  • The downsampling operations are removed and Dilated convolutions are employed in the last two ResNet blocks, thus enlarging the size of the final feature map size to 1/8 of the input image.
  • A convolution layer (Grey color block after ResNet) is firstly applied to obtain the features of dimension reduction.
  • Then, the features are fed into two parallel attention modules, particularly the spatial attention module and channel attention module (More details in next section), which can capture long-range contextual information.
  1. The first step is to generate a spatial attention matrix which models the spatial relationship between any two pixels of the features.
  2. Next, a matrix multiplication is performed between the attention matrix and the original features.
  3. Third, an element-wise sum operation is performed on the above multiplied resulting matrix and original features to obtain the final representations reflecting long-range contexts.

2. Position & Channel Attention Modules (PAM & CAM)

2.1. Position Attention Module (PAM)

Position Attention Module
  • The process is similar to the self-attention layer in Transformer.
  • Given a local feature A of size C×H×W, it is first fed into a convolution layers to generate two new feature maps B and C, of size C×H×W, respectively. Then, they are reshaped to C×N, where N=H×W is the number of pixels.
  • After that, a matrix multiplication is performed between the transpose of C and B, and a softmax layer is applied to calculate the spatial attention map S of size N×N:
  • Meanwhile, feature A is fed into a convolution layer to generate a new feature map D of size C×H×W and reshaped to C×N. Then, a matrix multiplication is performed between D and the transpose of S and reshaped to C×H×W.
  • Finally, it is multiplied by a scale parameter α and a element-wise sum operation is performed with the features A to obtain the final output E of size C×H×W:

2.2. Channel Attention Module (CAM)

Channel Attention Module
  • The process is similar to PAM.
  • But different from PAM, the original features A is directly reshaped, and multiplied its reshaped & transposed version, then softmax to generate the channel attention map X of size C×C.
  • Then, the result is scaled by a scale parameter β and an element-wise sum operation performed with A to obtain the final output E:

3. Experimental Results

3.1. Ablation Study

Ablation study on Cityscapes val set.
  • Compared with the baseline FCN (ResNet-50), employing PAM yields a result of 75.74% in Mean IoU, which brings 5.71% improvement.
  • Meanwhile, employing CAM individually outperforms the baseline by 4.25%.
Ablation study on PASCAL VOC 2012 val set.
Performance comparison between different strategies on Cityscape val set.
  1. DA: Data augmentation with random scaling.
  2. Multi-Grid: we apply employ a hierarchy of grids of different sizes (4, 8, 16) in the last ResNet block.
  3. MS: The segmentation probability maps are averaged from 8 image scales {0.5, 0.75, 1, 1.25, 1.5, 1.75, 2, 2.2} for inference.

3.2. SOTA Comparisons

Per-class results on Cityscapes testing set.
Segmentation results on PASCAL VOC 2012 testing set.
Segmentation results on PASCAL Context testing set.
  • Furthermore, with a deep pretrained network ResNet-101, DANet results achieve Mean IoU 52.6%, which outperforms previous methods by a large margin.
Segmentation results on COCO Stuff testing set.

3.3. Visualization & Analysis

Visualization results of PAM on Cityscapes val set.
  • With PAM, some details and object boundaries are clearer, such as the ‘pole’ in the first row and the ‘sidewalk’ in the second row.
  • Selective fusion over local features enhance the discrimination of details.
Visualization results of CAM on Cityscapes val set.
  • With CAM, some misclassified category are now correctly classified, such as the ‘bus’ in the first and third row.
  • The selective integration among channel maps helps to capture context information. The semantic consistency have been improved obviously.
Visualization results of attention modules on Cityscapes val set.
  • The response of specific semantic is noticeable after channel attention module enhances.
  • For example, 11-th channel map responds to the ‘car’ class in all three examples, and 4-th channel map is for the ‘vegetation’ class, which benefits for the segmentation of two scene categories.

Reference

[2019 CVPR] [DANet]
Dual Attention Network for Scene Segmentation

1.6. Semantic Segmentation / Scene Parsing

20152019 [DANet] … 2021 [PVT, PVTv1] [SETR] 2022 [PVTv2]

My Other Previous Paper Readings

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store