Review: ResNet-DUC-HDC — Dense Upsampling Convolution and Hybrid Dilated Convolution (Semantic Segmentation)
Outperforms FCN, DilatedNet, and DeepLabv2
In this story, ResNet-DUC-HDC framework, by , is reviewed. There are two major techniques proposed here:
- DUC (Dense Upsampling Convolution) — generates pixel-level prediction, which is able to capture and decode more detailed information that is generally missing in bilinear upsampling.
- HDC (Hybrid Dilated Convolution) — 1) effectively enlarges the receptive fields (RF) of the network to aggregate global information; 2) alleviates the “gridding issue” problem caused by standard dilated convolution.
This is published in 2018 WACV with more than 200 citations. (Sik-Ho Tsang @ Medium)
Outline
- DUC (Dense Upsampling Convolution)
- HDC (Hybrid Dilated Convolution)
- Ablation Study
- Test Set Results
1. DUC (Dense Upsampling Convolution)
- First, ResNet is used as backbone for feature extraction.
- In conventional FCN, at the final layer, a feature map with dimension h×w×c is obtained before making predictions, where h=H/d, w=W/d, and d is the downsampling factor.
- Bilinear upsampling or deconvolution network, is used to upsample, which is not good.
- DUC is applied here to make better prediction, as shown above.
- First, the feature map from ResNet of dimension h×w×c to get the output feature map of dimension h×w×(d²×L), where L is the total number of classes in the semantic segmentation task.
- Then, the output feature map is then reshaped to H×W×L with a softmax layer, and an elementwise argmax operator is applied to get the final label map.
- The key idea is to divide the whole label map into equal d² subparts which have the same height and width as the incoming feature map.
- This is to say, the whole label map is transformed into a smaller label map with multiple channels.
- Suppose a downsampling rate of 1/16 is used, if there is an object of height/width smaller than 16 pixels, it is more than likely that bilinear upsampling will not be able to recover this object.
- DUC helps to have better pixel-level decoding, and it is end-to-end trainable.
2. HDC (Hybrid Dilated Convolution)
- (Top): As shown at the top of the figure above, with constant dilated rate for consecutive dilated convolutions, some locations are not covered by dilated convolutions, which creates the gridding problem.
- (Bottom): With properly increasing dilated rate for consecutive dilated convolutions, a better covering is achieved within the effective receptive field, which yields higher accuracy performance.
- It is different from atrous pyramid pooling (ASPP) module in DeepLabv2, which use extra modules in parallel.
3. Ablation Study
- Baseline Model DeepLabv2 ResNet-101 is used.
- Cityscapes val set is used.
3.1. DUC (Dense Upsampling Convolution)
- Baseline: 70.9% mIoU is obtained.
- DS=8: 72.3%, yields better results than DS=4.
- DUC with ASPP: generally helps to improve the accuracy to 72.8%.
- With data augmentation: 74.3%.
- With cell=2 for DUC: more pixels involved, 74.7%.
- In addition, as frame size is large, it is divided into 800×800 patches for prediction. When training with larger patch size of 880×880, the performance is boosted to 75.7%.
- Applying CRF: 76.7%.
3.2. HDC (Hybrid Dilated Convolution)
- No dilation: 72.9%.
- Dilation-conv: For all blocks contain dilation, we group every 2 blocks together and make r = 2 for the first block, and r = 1 for the second block, 75.0%.
- Dilation-RF: For dilation rates to be {1, 2, 3}, and {3, 4, 5} dependings on the block positions, 75.4%.
- Dilation-bigger: A larger dilation rate of {1, 2, 5, 9}, {1, 2, 5} and {5, 9, 17}, 76.4%.
3.3. Deeper Networks
- Deeper: Generally, deeper ResNet-152 yields better results than ResNet-101 for any combinations.
- Coarse: Adding coarse data also yields better results as well.
3.4. Visualizations
4. Test Set Results
4.1. Cityscapes
- ResNet-DUC-HDC: 77.6%, outperforms FCN, DilatedNet, and DeepLabv2.
- ResNet-DUC-HDC-Coarse: 80.1%, outperforms ResNet-DUC-HDC.
4.2. KITTI Road Segmentation
- ResNet-DUC-HDC obtains 93.8% AP.
4.3. PASCAL VOC 2012
- ResNet-DUC: outperforms DeepLabv2.
Reference
[2018 WACV] [ResNet-DUC-HDC]
Understanding Convolution for Semantic Segmentation
My Previous Reviews
Image Classification [LeNet] [AlexNet] [Maxout] [NIN] [ZFNet] [VGGNet] [Highway] [SPPNet] [PReLU-Net] [STN] [DeepImage] [SqueezeNet] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [ResNet-38] [Shake-Shake] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet] [DRN] [DPN] [Residual Attention Network] [DMRNet / DFN-MR] [IGCNet / IGCV1] [MSDNet] [ShuffleNet V1] [SENet] [NASNet] [MobileNetV2]
Object Detection [OverFeat] [R-CNN] [Fast R-CNN] [Faster R-CNN] [MR-CNN & S-CNN] [DeepID-Net] [CRAFT] [R-FCN] [ION] [MultiPathNet] [NoC] [Hikvision] [GBD-Net / GBD-v1 & GBD-v2] [G-RMI] [TDM] [SSD] [DSSD] [YOLOv1] [YOLOv2 / YOLO9000] [YOLOv3] [FPN] [RetinaNet] [DCN]
Semantic Segmentation [FCN] [DeconvNet] [DeepLabv1 & DeepLabv2] [CRF-RNN] [SegNet] [ParseNet] [DilatedNet] [DRN] [RefineNet] [GCN] [PSPNet] [DeepLabv3] [ResNet-38] [ResNet-DUC-HDC] [LC] [FC-DenseNet] [IDW-CNN] [DIS] [SDN]
Biomedical Image Segmentation [CUMedVision1] [CUMedVision2 / DCAN] [U-Net] [CFS-FCN] [U-Net+ResNet] [MultiChannel] [V-Net] [3D U-Net] [M²FCN] [SA] [QSA+QNT] [3D U-Net+ResNet]
Instance Segmentation [SDS] [Hypercolumn] [DeepMask] [SharpMask] [MultiPathNet] [MNC] [InstanceFCN] [FCIS]
Super Resolution [SRCNN] [FSRCNN] [VDSR] [ESPCN] [RED-Net] [DRCN] [DRRN] [LapSRN & MS-LapSRN] [SRDenseNet]
Human Pose Estimation [DeepPose] [Tompson NIPS’14] [Tompson CVPR’15] [CPM]
Codec Post-Processing [ARCNN] [Lin DCC’16] [IFCNN] [Li ICME’17] [VRCNN] [DCAD] [DS-CNN]