Review — DANet: Dual Attention Network for Scene Segmentation
Semantic Segmentation Using Transformer Self Attention
Dual Attention Network for Scene Segmentation,
DANet, by Chinese Academy of Sciences, JD.com, and University of Chinese Academy of Sciences,
2019 CVPR, Over 3300 Citations (Sik-Ho Tsang @ Medium)
Semantic Segmentation, Transformer
Outline
- Dual Attention Network (DANet)
- Position & Channel Attention Modules (PAM & CAM)
- Experimental Results
1. Dual Attention Network (DANet)
- A pretrained residual network with the Dilated strategy (DeepLab or DilatedNet) is employed as the backbone.
- The downsampling operations are removed and Dilated convolutions are employed in the last two ResNet blocks, thus enlarging the size of the final feature map size to 1/8 of the input image.
- A convolution layer (Grey color block after ResNet) is firstly applied to obtain the features of dimension reduction.
- Then, the features are fed into two parallel attention modules, particularly the spatial attention module and channel attention module (More details in next section), which can capture long-range contextual information.
- The first step is to generate a spatial attention matrix which models the spatial relationship between any two pixels of the features.
- Next, a matrix multiplication is performed between the attention matrix and the original features.
- Third, an element-wise sum operation is performed on the above multiplied resulting matrix and original features to obtain the final representations reflecting long-range contexts.
2. Position & Channel Attention Modules (PAM & CAM)
2.1. Position Attention Module (PAM)
- The process is similar to the self-attention layer in Transformer.
- Given a local feature A of size C×H×W, it is first fed into a convolution layers to generate two new feature maps B and C, of size C×H×W, respectively. Then, they are reshaped to C×N, where N=H×W is the number of pixels.
- After that, a matrix multiplication is performed between the transpose of C and B, and a softmax layer is applied to calculate the spatial attention map S of size N×N:
- Meanwhile, feature A is fed into a convolution layer to generate a new feature map D of size C×H×W and reshaped to C×N. Then, a matrix multiplication is performed between D and the transpose of S and reshaped to C×H×W.
- Finally, it is multiplied by a scale parameter α and a element-wise sum operation is performed with the features A to obtain the final output E of size C×H×W:
PAM selectively aggregates the feature at each position by a weighted sum of the features at all positions. Similar features would be related to each other regardless of their distances.
2.2. Channel Attention Module (CAM)
- The process is similar to PAM.
- But different from PAM, the original features A is directly reshaped, and multiplied its reshaped & transposed version, then softmax to generate the channel attention map X of size C×C.
- Then, the result is scaled by a scale parameter β and an element-wise sum operation performed with A to obtain the final output E:
CAM selectively emphasizes interdependent channel maps by integrating associated features among all channel maps.
3. Experimental Results
3.1. Ablation Study
- Compared with the baseline FCN (ResNet-50), employing PAM yields a result of 75.74% in Mean IoU, which brings 5.71% improvement.
- Meanwhile, employing CAM individually outperforms the baseline by 4.25%.
When integrating the two attention modules together, the performance further improves to 76.34%.
When a deeper pre-trained network (ResNet-101) is used with two attention modules, the segmentation performance is significantly improved over the baseline model by 5.03%.
The proposed attention modules improves performance significantly, where DANet-50 exceeds the baseline by 3.3%. When adopting a deeper network ResNet-101, the model further achieves a Mean IoU of 80.4%.
- DA: Data augmentation with random scaling.
- Multi-Grid: we apply employ a hierarchy of grids of different sizes (4, 8, 16) in the last ResNet block.
- MS: The segmentation probability maps are averaged from 8 image scales {0.5, 0.75, 1, 1.25, 1.5, 1.75, 2, 2.2} for inference.
Finally, with other strategies, the segmentation map fusion further improves the performance to 81.50%, which outperforms well-known method DeepLabv3 (79.30% on Cityscape val set) by 2.20%.
3.2. SOTA Comparisons
DANet outperforms existing approaches with dominantly advantage. In particular, DANet outperforms PSANet by a large margin with the same backbone ResNet-101. Moreover, it also surpasses DenseASPP [25].
DANet is on par with PSPNet and EncNet with the same backbone.
The baseline (Dilated FCN-50) yields Mean IoU 44.3%. DANet-50 boosts the performance to 50.1%.
- Furthermore, with a deep pretrained network ResNet-101, DANet results achieve Mean IoU 52.6%, which outperforms previous methods by a large margin.
The above table again shows that DANet could capture long-range contextual information more effectively and learn better feature representation in scene segmentation.
3.3. Visualization & Analysis
- With PAM, some details and object boundaries are clearer, such as the ‘pole’ in the first row and the ‘sidewalk’ in the second row.
- Selective fusion over local features enhance the discrimination of details.
- With CAM, some misclassified category are now correctly classified, such as the ‘bus’ in the first and third row.
- The selective integration among channel maps helps to capture context information. The semantic consistency have been improved obviously.
- The response of specific semantic is noticeable after channel attention module enhances.
- For example, 11-th channel map responds to the ‘car’ class in all three examples, and 4-th channel map is for the ‘vegetation’ class, which benefits for the segmentation of two scene categories.
Reference
[2019 CVPR] [DANet]
Dual Attention Network for Scene Segmentation
1.6. Semantic Segmentation / Scene Parsing
2015 … 2019 [DANet] … 2021 [PVT, PVTv1] [SETR] 2022 [PVTv2]