Brief Review — Decoupled Deep Neural Network for Semi-supervised Semantic Segmentation

DecoupledNet, Decouple Classification & Segmentation

5 min readMar 3, 2023

Decoupled Deep Neural Network for Semi-supervised Semantic Segmentation,
DecoupledNet, by POSTECH,
2015 NIPS, Over 370 Citations (Sik-Ho Tsang @ Medium)
Semantic Segmentation, Semi-Supervised Learning, Weakly Supervised Learning
1.5. Semantic Segmentation / Scene Parsing / Instance Segmentation / Panoptic Segmentation
2014 … 2021 [PVT, PVTv1] [SETR] [Trans10K-v2, Trans2Seg] [Copy-Paste] 2022 [PVTv2] [YOLACT++]
My Other Previous Paper Readings Also Over Here

It is assumed that a large number of image-level annotations are available while there are only a few training images with segmentation annotations.
DecoupledNet decouples classification and segmentation, and learns a separate network for each task.
Labels associated with an image are identified by classification network, and binary segmentation is subsequently performed for each identified label in segmentation network. It facilitates to reduce search space for segmentation effectively by exploiting class-specific activation maps obtained from bridging layers.

Outline

DecoupledNet
Results

1. DecoupledNet

1.1. Overall Architecture

The network is composed of three parts: classification network, segmentation network and bridging layers connecting the two networks.
Given an input image, classification network identifies labels associated with the image, and segmentation network produces pixel-wise figure-ground segmentation corresponding to each identified label.
Bridging layers are added between the two networks and delivering class-specific information from classification network to segmentation network. Then, it is possible to optimize the two networks using separate objective functions.

1.2. Classification Network

The classification network takes an image x as its input, and outputs a normalized score vector S(x; θc).
The objective of classification network is to minimize error between ground-truths and estimated class labels, and is formally written as:

VGG-16 is used.
Sigmoid cross-entropy loss function is employed, which is a typical choice in multi-class classification tasks.

Given output scores S(xi; θc), the classification network identifies a set of labels Li associated with input image xi. The region in xi corresponding to each label l ∈ Li is predicted by the segmentation network discussed next.

1.3. Segmentation Network

**Input image (left) and its segmentation maps (right) of individual classes.**

The segmentation network takes a class-specific activation map gli of input image xi, which is obtained from bridging layers, and produces a two-channel class-specific segmentation map M(gli; θs) after applying softmax function, i.e. foreground and background classes.
The segmentation task is formulated as per-pixel regression to ground-truth segmentation, which minimizes:

DeconvNet is used, which consists of multiple series of operations of unpooling, deconvolution and rectification.

As it is now a binary classification problem, DecoupledNet reduces the number of parameters in the segmentation network significantly. This property is especially advantageous in an challenging scenario, where only a few pixel-wise annotations (typically 5 to 10 annotations per class) are available for training segmentation network.

1.4. Bridging Layers

**Examples of class-specific activation maps (output of bridging layers).**

The outputs from the last pooling layer (pool5) in the classification network are exploited, as the outputs preserve spatial information effectively while tend to capture more abstract and global information, which is denoted as fspat.
Class-specific saliency maps using the back-propagation technique proposed in [14] is also computed:

where flcls denotes class-specific saliency map and Sl is the classification score of class l.

The class-specific activation map gli is obtained by combining both fspat and flcls. fspat and flcls are first concatenated in their channel direction, and forward-propagate it through the fully-connected bridging layers.

Note that the changes in gli depend only on flcls since fspat is fixed for all classes in an input image.

1.5. Training

Let W={1, …, Nw} and S={1, …, Ns} denote the index sets of images with image-level and pixel-wise class labels, respectively, where Nw>>Ns.
The classification network is first trained using the images in W.
Then, fixing the weights in the classification network, the bridging layers and the segmentation network are jointly trained using images in S.
An effective data augmentation strategy, combinatorial cropping, is proposed.
Let L*i denotes a set of ground-truth labels associated with image xi, all possible combinations of labels are enumerated in P(L*i), where P(L*i) denotes the powerset of L*i.
A binary ground-truth segmentation mask zPi is constructed by setting the pixels corresponding to every label l ∈ P as foreground and the rests as background. Np sub-images are obtained.
Finally, Nt training samples with strong annotations are obtained effectively:

where Nt>>Ns.

2. Results

**Evaluation results on PASCAL VOC 2012 validation set.**

DecoupledNet has great advantage over WSSL when the number of strong annotations is extremely small. This is because DecoupledNet reduces search space for segmentation effectively.