Review — DANet: Dual Attention Network for Scene Segmentation
Semantic Segmentation Using Transformer Self Attention
--
Dual Attention Network for Scene Segmentation,
DANet, by Chinese Academy of Sciences, JD.com, and University of Chinese Academy of Sciences,
2019 CVPR, Over 3300 Citations (Sik-Ho Tsang @ Medium)
Semantic Segmentation, Transformer
Outline
- Dual Attention Network (DANet)
- Position & Channel Attention Modules (PAM & CAM)
- Experimental Results
1. Dual Attention Network (DANet)
- A pretrained residual network with the Dilated strategy (DeepLab or DilatedNet) is employed as the backbone.
- The downsampling operations are removed and Dilated convolutions are employed in the last two ResNet blocks, thus enlarging the size of the final feature map size to 1/8 of the input image.
- A convolution layer (Grey color block after ResNet) is firstly applied to obtain the features of dimension reduction.
- Then, the features are fed into two parallel attention modules, particularly the spatial attention module and channel attention module (More details in next section), which can capture long-range contextual information.
- The first step is to generate a spatial attention matrix which models the spatial relationship between any two pixels of the features.
- Next, a matrix multiplication is performed between the attention matrix and the original features.
- Third, an element-wise sum operation is performed on the above multiplied resulting matrix and original features to obtain the final representations reflecting long-range contexts.
2. Position & Channel Attention Modules (PAM & CAM)
2.1. Position Attention Module (PAM)
- The process is similar to the self-attention layer in Transformer.
- Given a local feature A of size C×H×W, it is first fed into a convolution layers to generate two new feature maps B and C, of size C×H×W, respectively. Then, they are reshaped to C×N, where N=H×W is the number of pixels.
- After that, a matrix multiplication is performed between the transpose of C and B, and a softmax layer is applied to calculate the spatial attention map S of size N×N:
- Meanwhile, feature A is fed into a convolution layer to generate a new feature map D of size C×H×W and reshaped to C×N. Then, a matrix multiplication is performed between D and the transpose of S and reshaped to C×H×W.
- Finally, it is multiplied by a scale parameter α and a element-wise sum operation is performed with the features A to obtain the final output E of size C×H×W:
PAM selectively aggregates the feature at each position by a weighted sum of the features at all positions. Similar features would be related to each other regardless of their distances.
2.2. Channel Attention Module (CAM)
- The process is similar to PAM.
- But different from PAM, the original features A is directly reshaped, and multiplied its reshaped & transposed version, then softmax to generate the channel attention map X of size C×C.
- Then, the result is scaled by a scale parameter β and an element-wise sum operation performed with A to obtain the final output E:
CAM selectively emphasizes interdependent channel maps by integrating associated features among all channel maps.
3. Experimental Results
3.1. Ablation Study
- Compared with the baseline FCN (ResNet-50), employing PAM yields a result of 75.74% in Mean IoU, which brings 5.71% improvement.
- Meanwhile, employing CAM individually outperforms the baseline by 4.25%.
When integrating the two attention modules together, the performance further improves to 76.34%.
When a deeper pre-trained network (ResNet-101) is used with two attention modules, the segmentation performance is significantly improved over the baseline model by 5.03%.
The proposed attention modules improves performance significantly, where DANet-50 exceeds the baseline by 3.3%. When adopting a deeper network ResNet-101, the model further achieves a Mean IoU of 80.4%.
- DA: Data augmentation with random scaling.
- Multi-Grid: we apply employ a hierarchy of grids of different sizes (4, 8, 16) in the last ResNet block.
- MS: The segmentation probability maps are averaged from 8 image scales {0.5, 0.75, 1, 1.25, 1.5, 1.75, 2, 2.2} for inference.
Finally, with other strategies, the segmentation map fusion further improves the performance to 81.50%, which outperforms well-known method DeepLabv3 (79.30% on Cityscape val set) by 2.20%.
3.2. SOTA Comparisons
DANet outperforms existing approaches with dominantly advantage. In particular, DANet outperforms PSANet by a large margin with the same backbone ResNet-101. Moreover, it also surpasses DenseASPP [25].
DANet is on par with PSPNet and EncNet with the same backbone.
The baseline (Dilated FCN-50) yields Mean IoU 44.3%. DANet-50 boosts the performance to 50.1%.
- Furthermore, with a deep pretrained network ResNet-101, DANet results achieve Mean IoU 52.6%, which outperforms previous methods by a large margin.
The above table again shows that DANet could capture long-range contextual information more effectively and learn better feature representation in scene segmentation.
3.3. Visualization & Analysis
- With PAM, some details and object boundaries are clearer, such as the ‘pole’ in the first row and the ‘sidewalk’ in the second row.
- Selective fusion over local features enhance the discrimination of details.
- With CAM, some misclassified category are now correctly classified, such as the ‘bus’ in the first and third row.
- The selective integration among channel maps helps to capture context information. The semantic consistency have been improved obviously.
- The response of specific semantic is noticeable after channel attention module enhances.
- For example, 11-th channel map responds to the ‘car’ class in all three examples, and 4-th channel map is for the ‘vegetation’ class, which benefits for the segmentation of two scene categories.
Reference
[2019 CVPR] [DANet]
Dual Attention Network for Scene Segmentation
1.6. Semantic Segmentation / Scene Parsing
2015 … 2019 [DANet] … 2021 [PVT, PVTv1] [SETR] 2022 [PVTv2]