Review: SDN — Stacked Deconvolutional Network Using DenseNet (Semantic Segmentation)

Stacking Multiple Encoder-Decoder Networks, outperforms FCN, DeepLabv1, DeepLabv2, DeepLabv3, DilatedNet, CRF-RNN, DeconvNet, PSPNet, FC-DenseNet, SegNet, RefineNet.

6 min readAug 1, 2019

In this story, SDN (Stacked Deconvolutional Network), by Chinese Academy of Sciences, and University of Chinese Academy of Sciences, is reviewed. In this paper:

Multiple shallow deconvolutional network, called SDN units, are stacked to integrate context information and guarantee the fine recovery of localization information.
Inter-unit and intra-unit skip connections are used, to assist the network training and enhance feature fusion.
Hierarchical supervision is applied to benefits the network optimization.

And it is a 2017 arXiv tech report with more than 40 citations. (Sik-Ho Tsang @ Medium)

Outline

SDN Network Architecture Overview
SDN Unit
Densely Connected SDN Units
Hierarchical Supervision
Ablation Study
Comparisons with State-of-the-art Approaches

1. SDN Network Architecture Overview

**SDN Network Architecture, (a) SDN Unit, (b) Downsampling Block, (c) Upsampling Block**

As at the top of above figure, three SDN units (Encoder-Decoder Network) are used.
For the encoder at the first SDN unit, it is a ImageNet pretrained DenseNet-161.

2. SDN Units

A SDN unit is composed of an encoder module and a decoder module, as shown above at (a).

2.1. Encoder

In encoder, two downsampling blocks are stacked such that 1/16 spatial resolution of the input image is obtained.
One downsampling block consists of a max pooling layer and 2 or more convolutional layers and a compression layer using 3×3 convolution, as shown above at (b).
Intra-unit skip connections are used to concatenate the input of previous convolutional layer to the output of the current layer.
A compression layer is to reduce the channel number so as to avoid too much GPU memory demanding.

2.2. Decoder

In decoder, two upsampling blocks are stacked to enlarge resolution back to 1/4 spatial resolution of the input image.
Similar to the encoder, convolutional layers, a compression layer, with intra-unit skip connections are used, as shown above at (c).

3. Densely Connected SDN Units

There are two types of inter-unit skip connections in the proposed framework. One is between any two adjacent SDN units, and the other is a kind of skip connections from the first SDN units to others.
The first type is to promote the flows of high-level semantic information and improve the optimization of encoder modules.
The second type is to fuse low-level representations and high-level semantic features, resulting in refined object segmentation edges.

4. Hierarchical Supervision

As shown above at (c), the output of certain upsampling block is fed to a pixel-wise classification layer to obtain a feature map E with channel C, where C is the number of possible labels.
The classification layer is with a 3×3 convolution operation.
And E is upsampled to match the size of the input image with bilinear interpolation, and finally supervised with pixel-wise groundtruth.

**Hierarchical supervision with score map connections during upsampling process.**

To enhance score map fusion before bilinear interpolation in the same resolution, the output at later layer is fused with the output at earlier layer by element-wise sum.
In the testing phase, we only use the highest-resolution result of the last unit as the final prediction.

5. Ablation Study

PASCAL VOC 2012 validation set is used.

5.1. Stacking Multiple SDN Units & Stacked Network Design

SDN_M1: One SDN unit, 78.2% mIoU.
SDN_M1+: One SDN unit with naive large decoder, 78.6% mIoU.
SDN_M2: Two SDN unit, 79.2% mIoU.
SDN_M3: Three SDN unit, 79.9% mIoU.
More SDN unit, the higher mIoU.

**Some Visualizations of PASCAL VOC 2012 Validation Set**

5.2. Hierarchical Supervision

SDN_M1_1: Only added supervision at up ratio = {4}, 77.5% mIoU.
SDN_M1_2: added supervision at up ratio = {8,4}, 78.0% mIoU.
SDN_M1: added supervision at up ratio = {16,8,4}, 78.2% mIoU.

5.3. Score Map Connections

SDN_M2-: Without score map connection, 78.8% mIoU.
SDN_M2: With score map connection, 79.2% mIoU.

5.4. Some Improvement Strategies

Up: By cascading a upsampling block to restore th high resolution features, 79.6% mIoU.
MS_Flip: Averaging the segmentation probability maps from 5 image scales {0.5, 0.8, 1, 1.2, 1.4} as well as their mirrors for inference, 80.7% mIoU.
COCO: Pretrain using MS COCO dataset, 84.8% mIoU.

6. Comparisons with State-of-the-art Approaches

6.1. PASCAL VOC 2012 Test Set

With only the use of VOC, SDN obtains 83.5% mIoU.
With COCO as well, SDN+ obtains 86.6% mIoU.
And SDN outperforms SOTA approaches such as FCN, DeepLabv2, CRF-RNN, DeconvNet, DilatedNet, RefineNet, PSPNet, and DeepLabv3.

**Some Visualizations: Input (Left), Groundtruth (Middle), SDN (Right)**

6.2. CamVid Test Set

SDN obtains 69.6% mIoU.
With network pretrained using VOC 2012, SDN+ obtains 71.8% mIoU.
And SDN outperforms SOTA approaches such as SegNet, DeconvNet, DeepLabv1, DilatedNet, FC-DenseNet.