Review: ResNet-DUC-HDC — Dense Upsampling Convolution and Hybrid Dilated Convolution (Semantic Segmentation)

Outperforms FCN, DilatedNet, and DeepLabv2

Sik-Ho Tsang
5 min readAug 21, 2019

In this story, ResNet-DUC-HDC framework, by , is reviewed. There are two major techniques proposed here:

  • DUC (Dense Upsampling Convolution) — generates pixel-level prediction, which is able to capture and decode more detailed information that is generally missing in bilinear upsampling.
  • HDC (Hybrid Dilated Convolution) — 1) effectively enlarges the receptive fields (RF) of the network to aggregate global information; 2) alleviates the “gridding issue” problem caused by standard dilated convolution.

This is published in 2018 WACV with more than 200 citations. (Sik-Ho Tsang @ Medium)

Outline

  1. DUC (Dense Upsampling Convolution)
  2. HDC (Hybrid Dilated Convolution)
  3. Ablation Study
  4. Test Set Results

1. DUC (Dense Upsampling Convolution)

HDC (Left Part) & DUC (Right Part)
  • First, ResNet is used as backbone for feature extraction.
  • In conventional FCN, at the final layer, a feature map with dimension h×w×c is obtained before making predictions, where h=H/d, w=W/d, and d is the downsampling factor.
  • Bilinear upsampling or deconvolution network, is used to upsample, which is not good.
  • DUC is applied here to make better prediction, as shown above.
  • First, the feature map from ResNet of dimension h×w×c to get the output feature map of dimension h×w×(d²×L), where L is the total number of classes in the semantic segmentation task.
  • Then, the output feature map is then reshaped to H×W×L with a softmax layer, and an elementwise argmax operator is applied to get the final label map.
  • The key idea is to divide the whole label map into equal d² subparts which have the same height and width as the incoming feature map.
  • This is to say, the whole label map is transformed into a smaller label map with multiple channels.
  • Suppose a downsampling rate of 1/16 is used, if there is an object of height/width smaller than 16 pixels, it is more than likely that bilinear upsampling will not be able to recover this object.
  • DUC helps to have better pixel-level decoding, and it is end-to-end trainable.

2. HDC (Hybrid Dilated Convolution)

Gridding Problem (Top), Gridding Problem Solved (Bottom)
  • (Top): As shown at the top of the figure above, with constant dilated rate for consecutive dilated convolutions, some locations are not covered by dilated convolutions, which creates the gridding problem.
  • (Bottom): With properly increasing dilated rate for consecutive dilated convolutions, a better covering is achieved within the effective receptive field, which yields higher accuracy performance.
  • It is different from atrous pyramid pooling (ASPP) module in DeepLabv2, which use extra modules in parallel.

3. Ablation Study

3.1. DUC (Dense Upsampling Convolution)

DS: Downsampling Rate, Cell: Neighboring pixels for DUC
  • Baseline: 70.9% mIoU is obtained.
  • DS=8: 72.3%, yields better results than DS=4.
  • DUC with ASPP: generally helps to improve the accuracy to 72.8%.
  • With data augmentation: 74.3%.
  • With cell=2 for DUC: more pixels involved, 74.7%.
  • In addition, as frame size is large, it is divided into 800×800 patches for prediction. When training with larger patch size of 880×880, the performance is boosted to 75.7%.
  • Applying CRF: 76.7%.
Cityscapes Val Set: Input Image, Ground-truth, Baseline, ResNet-DUC

3.2. HDC (Hybrid Dilated Convolution)

  • No dilation: 72.9%.
  • Dilation-conv: For all blocks contain dilation, we group every 2 blocks together and make r = 2 for the first block, and r = 1 for the second block, 75.0%.
  • Dilation-RF: For dilation rates to be {1, 2, 3}, and {3, 4, 5} dependings on the block positions, 75.4%.
  • Dilation-bigger: A larger dilation rate of {1, 2, 5, 9}, {1, 2, 5} and {5, 9, 17}, 76.4%.
Gridding Effects: Ground Truth (Top), ResNet-DUC (Middle), ResNet-DUC-HDC (Dilation-RF) (Bottom)

3.3. Deeper Networks

  • Deeper: Generally, deeper ResNet-152 yields better results than ResNet-101 for any combinations.
  • Coarse: Adding coarse data also yields better results as well.

3.4. Visualizations

Cityscapes Val Set: Input Image, Ground-truth, ResNet-DUC, ResNet-DUC-HDC

4. Test Set Results

4.1. Cityscapes

Cityscapes Test Set
  • ResNet-DUC-HDC: 77.6%, outperforms FCN, DilatedNet, and DeepLabv2.
  • ResNet-DUC-HDC-Coarse: 80.1%, outperforms ResNet-DUC-HDC.

4.2. KITTI Road Segmentation

  • ResNet-DUC-HDC obtains 93.8% AP.
Examples

4.3. PASCAL VOC 2012

PASCAL VOC 2012 Test Set
PASCAL VOC 2012 val set: Input, Ground Truth, Before CRF, After CRF

Reference

[2018 WACV] [ResNet-DUC-HDC]
Understanding Convolution for Semantic Segmentation

My Previous Reviews

Image Classification [LeNet] [AlexNet] [Maxout] [NIN] [ZFNet] [VGGNet] [Highway] [SPPNet] [PReLU-Net] [STN] [DeepImage] [SqueezeNet] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [ResNet-38] [Shake-Shake] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet] [DRN] [DPN] [Residual Attention Network] [DMRNet / DFN-MR] [IGCNet / IGCV1] [MSDNet] [ShuffleNet V1] [SENet] [NASNet] [MobileNetV2]

Object Detection [OverFeat] [R-CNN] [Fast R-CNN] [Faster R-CNN] [MR-CNN & S-CNN] [DeepID-Net] [CRAFT] [R-FCN] [ION] [MultiPathNet] [NoC] [Hikvision] [GBD-Net / GBD-v1 & GBD-v2] [G-RMI] [TDM] [SSD] [DSSD] [YOLOv1] [YOLOv2 / YOLO9000] [YOLOv3] [FPN] [RetinaNet] [DCN]

Semantic Segmentation [FCN] [DeconvNet] [DeepLabv1 & DeepLabv2] [CRF-RNN] [SegNet] [ParseNet] [DilatedNet] [DRN] [RefineNet] [GCN] [PSPNet] [DeepLabv3] [ResNet-38] [ResNet-DUC-HDC] [LC] [FC-DenseNet] [IDW-CNN] [DIS] [SDN]

Biomedical Image Segmentation [CUMedVision1] [CUMedVision2 / DCAN] [U-Net] [CFS-FCN] [U-Net+ResNet] [MultiChannel] [V-Net] [3D U-Net] [M²FCN] [SA] [QSA+QNT] [3D U-Net+ResNet]

Instance Segmentation [SDS] [Hypercolumn] [DeepMask] [SharpMask] [MultiPathNet] [MNC] [InstanceFCN] [FCIS]

Super Resolution [SRCNN] [FSRCNN] [VDSR] [ESPCN] [RED-Net] [DRCN] [DRRN] [LapSRN & MS-LapSRN] [SRDenseNet]

Human Pose Estimation [DeepPose] [Tompson NIPS’14] [Tompson CVPR’15] [CPM]

Codec Post-Processing [ARCNN] [Lin DCC’16] [IFCNN] [Li ICME’17] [VRCNN] [DCAD] [DS-CNN]

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.