Brief Review — PSANet: Point-wise Spatial Attention Network for Scene Parsing
PSANet: Point-wise Spatial Attention Network for Scene Parsing,
PSANet, by The Chinese University of Hong Kong, SenseTime Research, Nanyang Technological University, and Tencent Youtu Lab,
2018 ECCV, Over 700 Citations (Sik-Ho Tsang @ Medium)
Semantic Segmentation, Image Segmentation, Attention Network
- Point-wise Spatial Attention Network (PSANet) is proposed to relax the local neighborhood constraint. Each position on the feature map is connected to all the other ones through a self-adaptively learned attention mask.
- Moreover, information propagation in bi-direction for scene parsing is enabled. Information at other positions can be collected to help the prediction of the current position and vice versa, information at the current position can be distributed to assist the prediction of other ones.
Outline
- PSA Module
- PSANet
- Results
1. PSA Module
- The PSA module takes a spatial feature map X as input of spatial size H×W. Through the two branches as illustrated, pixel-wise global attention maps are generated for each position in feature map X through several convolutional layers.
- The module aggregates input feature map based on attention maps following the equation below to generate new feature representations with the long-range contextual information incorporated, i.e., Zc from the ‘collect’ branch and Zd from the ‘distribute’ branch:
- where aci,j and adi,j denote the predicted attention values in the point-wise attention maps Ac and Ad from ‘collect’ and ‘distribute’ branches, respectively:
To do this, these are done by feature map reshaping then convolution.
- The new representations Zc and Zd are then concatenated and a convolutional layer, with batch normalization and activation layers, is applied for dimension reduction and feature fusion.
- Then, the new global contextual feature is concatenated with the local representation feature X, followed by one or several convolutional layers with batch normalization and activation layers to generate the nal feature map for following subnetworks.
- It can be flexibly attached to any feature maps in the network.
- The two branches, with the same network structure, represent different information propagation directions.
- Briefly speaking, in the ‘collect’ branch, at each position i, with kth row and lth column, we predict how current position is related to other positions based on feature at position i.
- On the other hand, information is ‘distribute’ at the current position to other positions. At each position, it predicts how important the information at the current position to other positions is.
- By doing so, these two attention maps encode the context dependency between different position pairs in a complementary (bi-directional) way, leading to improved information propagation and enhanced utilization of long-range context.
2. PSANet
- ResNet is used as the FCN backbone.
- The PSA module follows stage-5 in ResNet, which is the final stage of the FCN backbone. Features in stage-5 are semantically stronger.
- Aggregating them together leads to a more comprehensive representation of long-range context. Moreover, the spatial size of the feature map at stage-5 is smaller. This can reduce computation overhead and memory consumption.
- Deep supervision technique is also used. An auxiliary loss branch is applied apart from the main loss as illustrated in the figure above.
3. Results
3.1. Objective Comparison
- Two network backbones, i.e., ResNet with 50 and 101 layers, are tested.
- The baseline network is ResNet-based FCN with dilated convolution module (DeepLab or DilatedNet) with dilations are set to 2 and 4 incorporated at stage 4 and 5.
- Taking ResNet-50 as an example with information flow in ‘collect’ mode (denoted as ‘+COLLECT’), it exceeds the baseline by 4.04/1.73.
- With the bi-direction information flow model (denoted as ‘+COLLECT+DISTRIBUTE’), the performance further increases to 41.92/80.17.
PSA module is a better choice in terms of capturing long-range contextual information, outperforms such as PSPNet, Non-Local, ASPP in DeepLabv2, GlobalPooling in ParseNet.
ADE20K (Left): With the same network backbone, PSANet gets higher performance than those of RefineNet and PSPNet.
VOC (Right): PSANet achieves top performance.
VOC (Left): PSA module boosts the performance greatly, exceeding the baseline by a large margin.
Cityscapes (Right): The improvement brought by PSA module based on the baseline method is shown.
- Training with only fine data and training with coarse+fine data are also tried.
PSANet achieves the best performance under both settings.
3.2. Subjective Comparison & Analysis
PSANet much improves the segmentation quality, where more accurate and detailed predictions are generated compared to the one without the PSA module.
The attention mask effectively focuses on related regions for better performance.
- For example in the first row, the mask for the red point, which locates on the beach, assigned a larger weight to the sea and beach which is beneficial to the prediction of red point.
- While the attention mask for the blue point in the sky assigns a higher weight to other sky regions.
- A similar trend is also spotted in other images.
Reference
[2018 ECCV] [PSANet]
PSANet: Point-wise Spatial Attention Network for Scene Parsing
1.6. Semantic Segmentation / Scene Parsing
2015 … 2018 [PSANet] … 2021 [PVT, PVTv1] [SETR] 2022 [PVTv2]