Review: ResNet-DUC-HDC — Dense Upsampling Convolution and Hybrid Dilated Convolution (Semantic Segmentation)

Outperforms FCN, DilatedNet, and DeepLabv2

5 min readAug 21, 2019

--

In this story, ResNet-DUC-HDC framework, by , is reviewed. There are two major techniques proposed here:

DUC (Dense Upsampling Convolution) — generates pixel-level prediction, which is able to capture and decode more detailed information that is generally missing in bilinear upsampling.
HDC (Hybrid Dilated Convolution) — 1) effectively enlarges the receptive fields (RF) of the network to aggregate global information; 2) alleviates the “gridding issue” problem caused by standard dilated convolution.

This is published in 2018 WACV with more than 200 citations. (Sik-Ho Tsang @ Medium)

Outline

DUC (Dense Upsampling Convolution)
HDC (Hybrid Dilated Convolution)
Ablation Study
Test Set Results

1. DUC (Dense Upsampling Convolution)

HDC (Left Part) & DUC (Right Part)

First, ResNet is used as backbone for feature extraction.
In conventional FCN, at the final layer, a feature map with dimension h×w×c is obtained before making predictions, where h=H/d, w=W/d, and d is the downsampling factor.
Bilinear upsampling or deconvolution network, is used to upsample, which is not good.
DUC is applied here to make better prediction, as shown above.
First, the feature map from ResNet of dimension h×w×c to get the output feature map of dimension h×w×(d²×L), where L is the total number of classes in the semantic segmentation task.
Then, the output feature map is then reshaped to H×W×L with a softmax layer, and an elementwise argmax operator is applied to get the final label map.
The key idea is to divide the whole label map into equal d² subparts which have the same height and width as the incoming feature map.
This is to say, the whole label map is transformed into a smaller label map with multiple channels.
Suppose a downsampling rate of 1/16 is used, if there is an object of height/width smaller than 16 pixels, it is more than likely that bilinear upsampling will not be able to recover this object.
DUC helps to have better pixel-level decoding, and it is end-to-end trainable.

2. HDC (Hybrid Dilated Convolution)

Gridding Problem (Top), Gridding Problem Solved (Bottom)

(Top): As shown at the top of the figure above, with constant dilated rate for consecutive dilated convolutions, some locations are not covered by dilated convolutions, which creates the gridding problem.
(Bottom): With properly increasing dilated rate for consecutive dilated convolutions, a better covering is achieved within the effective receptive field, which yields higher accuracy performance.
It is different from atrous pyramid pooling (ASPP) module in DeepLabv2, which use extra modules in parallel.

3. Ablation Study

Baseline Model DeepLabv2 ResNet-101 is used.
Cityscapes val set is used.

3.1. DUC (Dense Upsampling Convolution)

DS: Downsampling Rate, Cell: Neighboring pixels for DUC

Baseline: 70.9% mIoU is obtained.
DS=8: 72.3%, yields better results than DS=4.
DUC with ASPP: generally helps to improve the accuracy to 72.8%.
With data augmentation: 74.3%.
With cell=2 for DUC: more pixels involved, 74.7%.
In addition, as frame size is large, it is divided into 800×800 patches for prediction. When training with larger patch size of 880×880, the performance is boosted to 75.7%.
Applying CRF: 76.7%.

Cityscapes Val Set: Input Image, Ground-truth, Baseline, ResNet-DUC

3.2. HDC (Hybrid Dilated Convolution)

No dilation: 72.9%.
Dilation-conv: For all blocks contain dilation, we group every 2 blocks together and make r = 2 for the first block, and r = 1 for the second block, 75.0%.
Dilation-RF: For dilation rates to be {1, 2, 3}, and {3, 4, 5} dependings on the block positions, 75.4%.
Dilation-bigger: A larger dilation rate of {1, 2, 5, 9}, {1, 2, 5} and {5, 9, 17}, 76.4%.

Gridding Effects: Ground Truth (Top), ResNet-DUC (Middle), ResNet-DUC-HDC (Dilation-RF) (Bottom)

3.3. Deeper Networks

Deeper: Generally, deeper ResNet-152 yields better results than ResNet-101 for any combinations.
Coarse: Adding coarse data also yields better results as well.

3.4. Visualizations

Cityscapes Val Set: Input Image, Ground-truth, ResNet-DUC, ResNet-DUC-HDC

4. Test Set Results

4.1. Cityscapes

Cityscapes Test Set

ResNet-DUC-HDC: 77.6%, outperforms FCN, DilatedNet, and DeepLabv2.
ResNet-DUC-HDC-Coarse: 80.1%, outperforms ResNet-DUC-HDC.

4.2. KITTI Road Segmentation

ResNet-DUC-HDC obtains 93.8% AP.

Examples

4.3. PASCAL VOC 2012

PASCAL VOC 2012 Test Set

ResNet-DUC: outperforms DeepLabv2.

PASCAL VOC 2012 val set: Input, Ground Truth, Before CRF, After CRF

Reference

[2018 WACV] [ResNet-DUC-HDC]
Understanding Convolution for Semantic Segmentation

My Previous Reviews

Image Classification [LeNet] [AlexNet] [Maxout] [NIN] [ZFNet] [VGGNet] [Highway] [SPPNet] [PReLU-Net] [STN] [DeepImage] [SqueezeNet] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [ResNet-38] [Shake-Shake] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet] [DRN] [DPN] [Residual Attention Network] [DMRNet / DFN-MR] [IGCNet / IGCV1] [MSDNet] [ShuffleNet V1] [SENet] [NASNet] [MobileNetV2]

Object Detection [OverFeat] [R-CNN] [Fast R-CNN] [Faster R-CNN] [MR-CNN & S-CNN] [DeepID-Net] [CRAFT] [R-FCN] [ION] [MultiPathNet] [NoC] [Hikvision] [GBD-Net / GBD-v1 & GBD-v2] [G-RMI] [TDM] [SSD] [DSSD] [YOLOv1] [YOLOv2 / YOLO9000] [YOLOv3] [FPN] [RetinaNet] [DCN]

Semantic Segmentation [FCN] [DeconvNet] [DeepLabv1 & DeepLabv2] [CRF-RNN] [SegNet] [ParseNet] [DilatedNet] [DRN] [RefineNet] [GCN] [PSPNet] [DeepLabv3] [ResNet-38] [ResNet-DUC-HDC] [LC] [FC-DenseNet] [IDW-CNN] [DIS] [SDN]

Biomedical Image Segmentation [CUMedVision1] [CUMedVision2 / DCAN] [U-Net] [CFS-FCN] [U-Net+ResNet] [MultiChannel] [V-Net] [3D U-Net] [M²FCN] [SA] [QSA+QNT] [3D U-Net+ResNet]

Instance Segmentation [SDS] [Hypercolumn] [DeepMask] [SharpMask] [MultiPathNet] [MNC] [InstanceFCN] [FCIS]

Super Resolution [SRCNN] [FSRCNN] [VDSR] [ESPCN] [RED-Net] [DRCN] [DRRN] [LapSRN & MS-LapSRN] [SRDenseNet]

Human Pose Estimation [DeepPose] [Tompson NIPS’14] [Tompson CVPR’15] [CPM]

Codec Post-Processing [ARCNN] [Lin DCC’16] [IFCNN] [Li ICME’17] [VRCNN] [DCAD] [DS-CNN]

Machine Learning

Artificial Intelligence

Semantic Segmentation

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

No responses yet

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams