Review: DeepLabv3+ — Atrous Separable Convolution (Semantic Segmentation)

Outperforms LC, ResNet-DUC-HDC, GCN, RefineNet, ResNet-38, PSPNet, IDW-CNN, SDN, DIS, and DeepLabv3

Sik-Ho Tsang
7 min readSep 29, 2019

Let’s review about DeepLabv3+, which is invented by Google. DeepLab series has come along for versions from DeepLabv1 (2015 ICLR), DeepLabv2 (2018 TPAMI), and DeepLabv3 (arXiv).

  • (a): With Atrous Spatial Pyramid Pooling (ASPP), able to encode multi-scale contextual information.
  • (b): With Encoder-Decoder Architecture, the location/spatial information is recovered. Encoder-Decoder Architecture has been proved to be useful in literature such as FPN, DSSD, TDM, SharpMask, RED-Net, and U-Net for different kinds of purposes.
  • (c): DeepLabv3+ makes use of (a) and (b).
  • Further, with the use of Modified Aligned Xception, and Atrous Separable Convolution, a faster and stronger network is developed.
  • Finally, DeepLabv3+ outperforms PSPNet (1st place in 2016 ILSVRC Scene Parsing Challenge) and its previous DeepLabv3.

It is published in 2018 ECCV with more than 600 citations. (Sik-Ho Tsang @ Medium)

Outline

  1. Atrous Separable Convolution
  2. Encoder-Decoder Architecture
  3. Modified Aligned Xception
  4. Ablation Study
  5. Comparison with State-of-the-art Approaches

1. Atrous Separable Convolution

1.1. Atrous Convolution

Atrous Convolution with Different Rates r
Atrous Convolution
  • For each location i on the output y and a filter w, atrous convolution is applied over the input feature map x where the atrous rate r corresponds to the stride with which we sample the input signal.
  • (More details on DeepLabv3 about Atrous Convolution.)

1.2. Atrous Separable Convolution

Depthwise Separable Convolution Using Atrous Convolution
  • (a) and (b), Depthwise Separable Convolution: It factorize a standard convolution into a depthwise convolution followed by a point-wise convolution (i.e., 1×1 convolution), drastically reduces computation complexity.
  • This is introduced in MobileNetV1. (If interested. please read my review on MobileNetV1 about Depthwise Separable Convolution.)
  • (c) Atrous Depthwise Convolution: Atrous convolution is supported in the depthwise convolution. And it is found that it significantly reduces the computation complexity of proposed model while maintaining similar (or better) performance.
  • Combining with point-wise convolution, it is Atrous Separable Convolution.

2. Encoder-Decoder Architecture

DeepLabv3+ Extends DeepLabv3

2.1. DeepLabv3 as Encoder

  • For the task of image classification, the spatial resolution of the final feature maps is usually 32 times smaller than the input image resolution and thus output stride = 32.
  • For the task of semantic segmentation, it is too small.
  • One can adopt output stride = 16 (or 8) for denser feature extraction by removing the striding in the last one (or two) block(s) and applying the atrous convolution correspondingly.
  • Additionally, DeepLabv3 augments the Atrous Spatial Pyramid Pooling module, which probes convolutional features at multiple scales by applying atrous convolution with different rates, with the image-level features.
  • (Please read my DeepLabv3 review for details of encoder and ParseNet review for image-level features.)

2.2. Proposed Decoder

  • The encoder features are first bilinearly upsampled by a factor of 4 and then concatenated with the corresponding low-level features.
  • There is 1×1 convolution on the low-level features before concatenation to reduce the number of channels, since the corresponding low-level features usually contain a large number of channels (e.g., 256 or 512) which may outweigh the importance of the rich encoder features.
  • After the concatenation, we apply a few 3×3 convolutions to refine the features followed by another simple bilinear upsampling by a factor of 4.
  • This is much better comparing the one bilinearly upsampling 16× directly.

3. Modified Aligned Xception

3.1. Aligned Xception

Aligned Xception
  • Xception is introduced for image classification.
  • Then Aligned Xception is introduced in Deformable Convolutional Network (DCN) for object detection.
  • The update of Aligned Xception from original Xception is in blue colors.
  • To be brief, some of the max pooling operations are replaced by separable conv in the entry flow. The number of repeating is increased from 8 to 16 in the middle flow. One more conv is added in the exit flow.
  • (Please read my reviews on Xception and DCN if interested.)

3.2. Modified Aligned Xception

Modified Aligned Xception
  • Compared with Aligned Xception, a Deeper Xception network is used.
  • All max pooling operations are replaced by depthwise separable convolution with striding, in which atrous separable convolution is applied to extract feature maps at an arbitrary resolution.
  • Extra batch normalization and ReLU activation are added after each 3×3 depthwise convolution.

4. Ablation Study

4.1. Decoder Design

Effects of different number of channels of 1×1 convolution on PASCAL VOC 2012 val set
  • It is found that 48 channels of 1×1 convolution used to reduce the channels of low-level feature map has the best performance.
PASCAL VOC 2012 val set
  • And it is most effective to use the Conv2 (before striding) feature map and two extra [3×3 conv; 256 channels] operations.

4.2. Model Variants with ResNet as Backbone

PASCAL VOC 2012 val set
  • Baseline (First row block): 77.21% to 79.77% mIOU.
  • With Decoder (Second row block): The performance is improved from
  • 77.21% to 78.85% or 78.51% to 79.35%.
  • The performance is further improved to 80.57% when using multi-scale and left-right flipped inputs.
  • Coarser feature maps (Third row block): i.e. stride = 32, the performance is not good.

4.3. Modified Aligned Xception as Backbone

PASCAL VOC 2012 val set
  • Baseline (First row block): 79.17% to 81.34% mIOU.
  • With Decoder (Second row block): 79.93% to 81.63% mIOU.
  • Using Depthwise Separable Convolution (Third row block): Multiply-Adds is significantly reduced by 33% to 41%, while similar mIOU performance is obtained.
  • Pretraining on COCO (Fourth row block): Extra 2% improvement.
  • Pretraining on JFT (Fifth row block): Extra 0.8% to 1% improvement.

4.4. Visualization

val set visualization, last row is failure mode

5. Comparison with State-of-the-art Approaches

5.1. PASCAL VOC 2012 Test Set

3.2. Cityscapes

val set
  • X-71: With deeper Modified Aligned Xception network (compared with X-65), and the use of decoder and ASSP, but removing the image-level features, 79.55% mIOU is obtained.
  • The image-level features are more effective on the PASCAL VOC 2012 dataset.
test set

There are a lot of techniques and approaches which are based on previous SOTA approaches in DeepLabv3+. The story will become too long if I include too many of them. Please feel free to read those reviews. Thanks.

Reference

[2018 ECCV] [DeepLabv3+]
Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

My Previous Reviews

Image Classification [LeNet] [AlexNet] [Maxout] [NIN] [ZFNet] [VGGNet] [Highway] [SPPNet] [PReLU-Net] [STN] [DeepImage] [SqueezeNet] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [ResNet-38] [Shake-Shake] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet] [DRN] [DPN] [Residual Attention Network] [DMRNet / DFN-MR] [IGCNet / IGCV1] [MSDNet] [ShuffleNet V1] [SENet] [NASNet] [MobileNetV2]

Object Detection [OverFeat] [R-CNN] [Fast R-CNN] [Faster R-CNN] [MR-CNN & S-CNN] [DeepID-Net] [CRAFT] [R-FCN] [ION] [MultiPathNet] [NoC] [Hikvision] [GBD-Net / GBD-v1 & GBD-v2] [G-RMI] [TDM] [SSD] [DSSD] [YOLOv1] [YOLOv2 / YOLO9000] [YOLOv3] [FPN] [RetinaNet] [DCN]

Semantic Segmentation [FCN] [DeconvNet] [DeepLabv1 & DeepLabv2] [CRF-RNN] [SegNet] [ParseNet] [DilatedNet] [DRN] [RefineNet] [GCN] [PSPNet] [DeepLabv3] [ResNet-38] [ResNet-DUC-HDC] [LC] [FC-DenseNet] [IDW-CNN] [DIS] [SDN] [DeepLabv3+]

Biomedical Image Segmentation [CUMedVision1] [CUMedVision2 / DCAN] [U-Net] [CFS-FCN] [U-Net+ResNet] [MultiChannel] [V-Net] [3D U-Net] [M²FCN] [SA] [QSA+QNT] [3D U-Net+ResNet] [Cascaded 3D U-Net] [Attention U-Net] [RU-Net & R2U-Net]

Instance Segmentation [SDS] [Hypercolumn] [DeepMask] [SharpMask] [MultiPathNet] [MNC] [InstanceFCN] [FCIS]

Super Resolution [SRCNN] [FSRCNN] [VDSR] [ESPCN] [RED-Net] [DRCN] [DRRN] [LapSRN & MS-LapSRN] [SRDenseNet] [SR+STN]

Human Pose Estimation [DeepPose] [Tompson NIPS’14] [Tompson CVPR’15] [CPM]

Codec Post-Processing [ARCNN] [Lin DCC’16] [IFCNN] [Li ICME’17] [VRCNN] [DCAD] [DS-CNN]

Generative Adversarial Network [GAN]

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.