Brief Review — Weakly Supervised Semantic Segmentation Using Superpixel Pooling Network

Superpixel Pooling Network (SPN)

5 min readJul 7, 2023

Weakly Supervised Semantic Segmentation Using Superpixel Pooling Network,
Superpixel Pooling Network (SPN), by DGIST, and POSTECH,
2017 AAAI, Over 130 Citations (Sik-Ho Tsang @ Medium)
Semantic Segmentation / Scene Parsing / Instance Segmentation / Panoptic Segmentation
2021 [PVT, PVTv1] [SETR] [Trans10K-v2, Trans2Seg] [Copy-Paste] [HRNetV2, HRNetV2p] [Lite-HRNet] 2022 [PVTv2] [YOLACT++] 2023 [Segment Anything Model (SAM)]
==== My Other Paper Readings Are Also Over Here ====

Superpixel Pooling Network (SPN) is proposed, which utilizes superpixel segmentation of input image as a pooling layout to reflect low-level image structure for learning and inferring semantic segmentation.
The initial annotations generated by SPN are then used to learn another neural network that estimates pixelwise semantic labels. The architecture of the segmentation network decouples semantic segmentation task into classification and segmentation so that the network learns class-agnostic shape prior from the noisy annotations. It turns out that both networks are critical to improve semantic.

Outline

Superpixel Pooling Network (SPN)
Superpixel-Pooled Class Activation Map (SP-CAM)
Iterative Learning of DecoupledNet
Results

1. Superpixel Pooling Network (SPN)

SPN is composed of three parts: 1) a feature encoder fenc, 2) an upsampling module composed of a feature upsampler fups and the superpixel pooling layer, and 3) two classification modules that classify feature vectors obtained from the encoder and the upsampling module.
The entire network is learned with the two separate classification losses computed by the last component.

1.1. Encoder

The encoder of SPN computes a convolutional feature map z = fenc(x) of input image x.
The encoder is a frozen ImageNet-pretrained VGG-16, excluding the fully connected layers.
An additional convolutional layer is added at the end.

1.2. Feature Upsampler

A non-linear upsampling module fups is added.
This module consists of two deconvolution layers and one unpooling layer (DeconvNet) followed by another two deconvolution layers.
A batch normalization layer and a ReLU are attached after every deconvolution layer.
A shared pooling switch (Noh, Hong, and Han 2015) is employed between the last pooling layer of the encoder and the unpooling layer, which is known to be useful to reconstruct object structure in the semantic segmentation scenario.

1.3. Superpixel Pooling (SP) Layer & Top Branch

Through the SP layer, feature vectors are aggregated spatially aligned with superpixels by average-pooling.
The output of the SP layer then becomes an N × K matrix, where N means the number of superpixels in the input image and K indicates the number of channels in the feature map (K = 512 in the current SPN architecture).
The N×K superpixel features are then averaged over superpixels to build a single 1×K vector, which will be classified by the following fully-connected layer to compute and backpropagate the classification loss.
Specifically, the pooled feature vector of the i-th superpixel is then given by:

Through global average pooling over all superpixels, a single feature vector is obtained for the input image, which is given by:

˜z is then classified after a fully connected layer for classification loss.

1.4. Bottom Branch

SPN has a branch that directly applies global average pooling to the feature map z and classifies the aggregated feature vector, for easier convergence.

1.5. Loss Function

Given C object class, the loss is the sum of C binary classification losses:

1.6. Multi-Scale

The images are padded to be square, and rescaled randomly to one of the 6 predefined sizes: 250², 300², 350², 400², 450², and 500² pixels, to better model scale variations of objects.

2. Superpixel-Pooled Class Activation Map (SP-CAM)

During inference, the feature vector of each superpixel is first given to the fully-connected classification layer following the SP layer, and the class scores of the individual superpixels are computed.

SP-CAM assigns class activations to individual superpixels, which allows to generate class activation scores in the original resolution with image structures preserved.

This property of SP-CAM is particularly useful for semantic segmentation as illustrated above.

The activation map of each class in SP-CAM is thresholded by 50% of the maximum score of the map. If the activation scores of a pixel are below a predefined threshold in all the C channels, it is considered as background.

3. Iterative Learning of DecoupledNet

SPN tends to exaggerate the class scores of discriminative superpixels. Some parts of an object could be lost during initial annotation, as above.

A solution to resolve this issue would be to get hints from annotations of other images, and this idea can be realized by learning class-agnostic segmentation knowledge from DecoupledNet.

The classification network of DecoupledNet is trained with image-level labels and serves as a feature encoder in the above network.

In each round of the algorithm, DecoupledNet is learned from generated annotations, which are provided by SPN at the first round and by DecoupledNet trained in the previous iteration from the second round.