Review: SDN — Stacked Deconvolutional Network Using DenseNet (Semantic Segmentation)

Stacking Multiple Encoder-Decoder Networks, outperforms FCN, DeepLabv1, DeepLabv2, DeepLabv3, DilatedNet, CRF-RNN, DeconvNet, PSPNet, FC-DenseNet, SegNet, RefineNet.

Sik-Ho Tsang
6 min readAug 1, 2019

In this story, SDN (Stacked Deconvolutional Network), by Chinese Academy of Sciences, and University of Chinese Academy of Sciences, is reviewed. In this paper:

  • Multiple shallow deconvolutional network, called SDN units, are stacked to integrate context information and guarantee the fine recovery of localization information.
  • Inter-unit and intra-unit skip connections are used, to assist the network training and enhance feature fusion.
  • Hierarchical supervision is applied to benefits the network optimization.

And it is a 2017 arXiv tech report with more than 40 citations. (Sik-Ho Tsang @ Medium)

Outline

  1. SDN Network Architecture Overview
  2. SDN Unit
  3. Densely Connected SDN Units
  4. Hierarchical Supervision
  5. Ablation Study
  6. Comparisons with State-of-the-art Approaches

1. SDN Network Architecture Overview

SDN Network Architecture, (a) SDN Unit, (b) Downsampling Block, (c) Upsampling Block
  • As at the top of above figure, three SDN units (Encoder-Decoder Network) are used.
  • For the encoder at the first SDN unit, it is a ImageNet pretrained DenseNet-161.

2. SDN Units

  • A SDN unit is composed of an encoder module and a decoder module, as shown above at (a).

2.1. Encoder

  • In encoder, two downsampling blocks are stacked such that 1/16 spatial resolution of the input image is obtained.
  • One downsampling block consists of a max pooling layer and 2 or more convolutional layers and a compression layer using 3×3 convolution, as shown above at (b).
  • Intra-unit skip connections are used to concatenate the input of previous convolutional layer to the output of the current layer.
  • A compression layer is to reduce the channel number so as to avoid too much GPU memory demanding.

2.2. Decoder

  • In decoder, two upsampling blocks are stacked to enlarge resolution back to 1/4 spatial resolution of the input image.
  • Similar to the encoder, convolutional layers, a compression layer, with intra-unit skip connections are used, as shown above at (c).

3. Densely Connected SDN Units

  • There are two types of inter-unit skip connections in the proposed framework. One is between any two adjacent SDN units, and the other is a kind of skip connections from the first SDN units to others.
  • The first type is to promote the flows of high-level semantic information and improve the optimization of encoder modules.
  • The second type is to fuse low-level representations and high-level semantic features, resulting in refined object segmentation edges.

4. Hierarchical Supervision

  • As shown above at (c), the output of certain upsampling block is fed to a pixel-wise classification layer to obtain a feature map E with channel C, where C is the number of possible labels.
  • The classification layer is with a 3×3 convolution operation.
  • And E is upsampled to match the size of the input image with bilinear interpolation, and finally supervised with pixel-wise groundtruth.
Hierarchical supervision with score map connections during upsampling process.
  • To enhance score map fusion before bilinear interpolation in the same resolution, the output at later layer is fused with the output at earlier layer by element-wise sum.
  • In the testing phase, we only use the highest-resolution result of the last unit as the final prediction.

5. Ablation Study

  • PASCAL VOC 2012 validation set is used.

5.1. Stacking Multiple SDN Units & Stacked Network Design

Different Stacked SDN Structures
PASCAL VOC 2012 Validation Set
  • SDN_M1: One SDN unit, 78.2% mIoU.
  • SDN_M1+: One SDN unit with naive large decoder, 78.6% mIoU.
  • SDN_M2: Two SDN unit, 79.2% mIoU.
  • SDN_M3: Three SDN unit, 79.9% mIoU.
  • More SDN unit, the higher mIoU.
Some Visualizations of PASCAL VOC 2012 Validation Set

5.2. Hierarchical Supervision

  • SDN_M1_1: Only added supervision at up ratio = {4}, 77.5% mIoU.
  • SDN_M1_2: added supervision at up ratio = {8,4}, 78.0% mIoU.
  • SDN_M1: added supervision at up ratio = {16,8,4}, 78.2% mIoU.

5.3. Score Map Connections

PASCAL VOC 2012 Validation Set
  • SDN_M2-: Without score map connection, 78.8% mIoU.
  • SDN_M2: With score map connection, 79.2% mIoU.

5.4. Some Improvement Strategies

PASCAL VOC 2012 Validation Set
  • Up: By cascading a upsampling block to restore th high resolution features, 79.6% mIoU.
  • MS_Flip: Averaging the segmentation probability maps from 5 image scales {0.5, 0.8, 1, 1.2, 1.4} as well as their mirrors for inference, 80.7% mIoU.
  • COCO: Pretrain using MS COCO dataset, 84.8% mIoU.

6. Comparisons with State-of-the-art Approaches

6.1. PASCAL VOC 2012 Test Set

PASCAL VOC 2012 Test Set
Some Visualizations: Input (Left), Groundtruth (Middle), SDN (Right)

6.2. CamVid Test Set

CamVid Test Set
Some Visualizations: Input (Top), Groundtruth (Middle), SDN (Bottom)

6.3. GATECH Test Set

GATECH Test Set
  • SDN obtains 53.5% mIoU.
  • With network pretrained using VOC 2012, SDN+ obtains 55.9% mIoU.
  • And SDN outperforms FC-DenseNet.
Some Visualizations: Input (Top), Groundtruth (Middle), SDN (Bottom)

Reference

[2017 arXiv] [SDN]
Stacked Deconvolutional Network for Semantic Segmentation

My Previous Reviews

Image Classification [LeNet] [AlexNet] [Maxout] [NIN] [ZFNet] [VGGNet] [Highway] [SPPNet] [PReLU-Net] [STN] [DeepImage] [SqueezeNet] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [Shake-Shake] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet] [DRN] [DPN] [Residual Attention Network] [DMRNet / DFN-MR] [IGCNet / IGCV1] [MSDNet] [ShuffleNet V1] [SENet] [NASNet] [MobileNetV2]

Object Detection [OverFeat] [R-CNN] [Fast R-CNN] [Faster R-CNN] [MR-CNN & S-CNN] [DeepID-Net] [CRAFT] [R-FCN] [ION] [MultiPathNet] [NoC] [Hikvision] [GBD-Net / GBD-v1 & GBD-v2] [G-RMI] [TDM] [SSD] [DSSD] [YOLOv1] [YOLOv2 / YOLO9000] [YOLOv3] [FPN] [RetinaNet] [DCN]

Semantic Segmentation [FCN] [DeconvNet] [DeepLabv1 & DeepLabv2] [CRF-RNN] [SegNet] [ParseNet] [DilatedNet] [DRN] [RefineNet] [GCN] [PSPNet] [DeepLabv3] [LC] [FC-DenseNet] [IDW-CNN] [SDN]

Biomedical Image Segmentation [CUMedVision1] [CUMedVision2 / DCAN] [U-Net] [CFS-FCN] [U-Net+ResNet] [MultiChannel] [V-Net] [3D U-Net] [M²FCN] [SA] [QSA+QNT] [3D U-Net+ResNet]

Instance Segmentation [SDS] [Hypercolumn] [DeepMask] [SharpMask] [MultiPathNet] [MNC] [InstanceFCN] [FCIS]

Super Resolution [SRCNN] [FSRCNN] [VDSR] [ESPCN] [RED-Net] [DRCN] [DRRN] [LapSRN & MS-LapSRN] [SRDenseNet]

Human Pose Estimation [DeepPose] [Tompson NIPS’14] [Tompson CVPR’15] [CPM]

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

No responses yet