Brief Review — PSANet: Point-wise Spatial Attention Network for Scene Parsing

Outperforms PSPNet, Non-Local, DeepLabv2, ParseNet, etc.

Sik-Ho Tsang
5 min readNov 18, 2022

PSANet: Point-wise Spatial Attention Network for Scene Parsing,
PSANet
, by The Chinese University of Hong Kong, SenseTime Research, Nanyang Technological University, and Tencent Youtu Lab,
2018 ECCV, Over 700 Citations (Sik-Ho Tsang @ Medium)
Semantic Segmentation, Image Segmentation, Attention Network

  • Point-wise Spatial Attention Network (PSANet) is proposed to relax the local neighborhood constraint. Each position on the feature map is connected to all the other ones through a self-adaptively learned attention mask.
  • Moreover, information propagation in bi-direction for scene parsing is enabled. Information at other positions can be collected to help the prediction of the current position and vice versa, information at the current position can be distributed to assist the prediction of other ones.

Outline

  1. PSA Module
  2. PSANet
  3. Results

1. PSA Module

  • The PSA module takes a spatial feature map X as input of spatial size H×W. Through the two branches as illustrated, pixel-wise global attention maps are generated for each position in feature map X through several convolutional layers.
  • The module aggregates input feature map based on attention maps following the equation below to generate new feature representations with the long-range contextual information incorporated, i.e., Zc from the ‘collect’ branch and Zd from the ‘distribute’ branch:
  • where aci,j and adi,j denote the predicted attention values in the point-wise attention maps Ac and Ad from ‘collect’ and ‘distribute’ branches, respectively:

To do this, these are done by feature map reshaping then convolution.

  • The new representations Zc and Zd are then concatenated and a convolutional layer, with batch normalization and activation layers, is applied for dimension reduction and feature fusion.
  • Then, the new global contextual feature is concatenated with the local representation feature X, followed by one or several convolutional layers with batch normalization and activation layers to generate the nal feature map for following subnetworks.
  • It can be flexibly attached to any feature maps in the network.
Point-wise Spatial Attention
  • The two branches, with the same network structure, represent different information propagation directions.
Each position both ‘collects’ and ‘distributes’ information for more comprehensive information propagation.
  • Briefly speaking, in the ‘collect’ branch, at each position i, with kth row and lth column, we predict how current position is related to other positions based on feature at position i.
  • On the other hand, information is ‘distribute’ at the current position to other positions. At each position, it predicts how important the information at the current position to other positions is.
  • By doing so, these two attention maps encode the context dependency between different position pairs in a complementary (bi-directional) way, leading to improved information propagation and enhanced utilization of long-range context.

2. PSANet

Network structure of ResNet-FCN-backbone with PSA module incorporated. Deep supervision is also adopted for better performance.
  • ResNet is used as the FCN backbone.
  • The PSA module follows stage-5 in ResNet, which is the final stage of the FCN backbone. Features in stage-5 are semantically stronger.
  • Aggregating them together leads to a more comprehensive representation of long-range context. Moreover, the spatial size of the feature map at stage-5 is smaller. This can reduce computation overhead and memory consumption.
  • Deep supervision technique is also used. An auxiliary loss branch is applied apart from the main loss as illustrated in the figure above.

3. Results

3.1. Objective Comparison

Validation set of ADE20K dataset. SS: Single-Scale Testing, MS: Multi-Scale Testing.
  • Two network backbones, i.e., ResNet with 50 and 101 layers, are tested.
  • The baseline network is ResNet-based FCN with dilated convolution module (DeepLab or DilatedNet) with dilations are set to 2 and 4 incorporated at stage 4 and 5.
  • Taking ResNet-50 as an example with information flow in ‘collect’ mode (denoted as ‘+COLLECT’), it exceeds the baseline by 4.04/1.73.
  • With the bi-direction information flow model (denoted as ‘+COLLECT+DISTRIBUTE’), the performance further increases to 41.92/80.17.

PSA module is a better choice in terms of capturing long-range contextual information, outperforms such as PSPNet, Non-Local, ASPP in DeepLabv2, GlobalPooling in ParseNet.

Left: ADE20K validation set. Right: VOC 2012 test set.

ADE20K (Left): With the same network backbone, PSANet gets higher performance than those of RefineNet and PSPNet.

VOC (Right): PSANet achieves top performance.

Improvements introduced by PSA module. (a) val set of VOC 2012. (b) fine val set of Cityscapes.

VOC (Left): PSA module boosts the performance greatly, exceeding the baseline by a large margin.

Cityscapes (Right): The improvement brought by PSA module based on the baseline method is shown.

Cityscapes test set. (a) Left, (b) Right.
  • Training with only fine data and training with coarse+fine data are also tried.

PSANet achieves the best performance under both settings.

3.2. Subjective Comparison & Analysis

Visual improvement on validation set of ADE20K.

PSANet much improves the segmentation quality, where more accurate and detailed predictions are generated compared to the one without the PSA module.

Visualization of learned masks by PSANet

The attention mask effectively focuses on related regions for better performance.

  • For example in the first row, the mask for the red point, which locates on the beach, assigned a larger weight to the sea and beach which is beneficial to the prediction of red point.
  • While the attention mask for the blue point in the sky assigns a higher weight to other sky regions.
  • A similar trend is also spotted in other images.

Reference

[2018 ECCV] [PSANet]
PSANet: Point-wise Spatial Attention Network for Scene Parsing

1.6. Semantic Segmentation / Scene Parsing

20152018 [PSANet] … 2021 [PVT, PVTv1] [SETR] 2022 [PVTv2]

My Other Previous Paper Readings

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

No responses yet