# Review: DeepLabv3+ — Atrous Separable Convolution (Semantic Segmentation)

## Outperforms LC, ResNet-DUC-HDC, GCN, RefineNet, ResNet-38, PSPNet, IDW-CNN, SDN, DIS, and DeepLabv3

Let’s review about **DeepLabv3+**, which is invented by **Google**. DeepLab series has come along for versions from DeepLabv1 (2015 ICLR), DeepLabv2 (2018 TPAMI), and DeepLabv3 (arXiv).

**(a)**: With**Atrous Spatial Pyramid Pooling (ASPP)**, able to encode multi-scale contextual information.**(b)**: With**Encoder-Decoder Architecture, the location/spatial information is recovered.**Encoder-Decoder Architecture has been proved to be useful in literature such as FPN, DSSD, TDM, SharpMask, RED-Net, and U-Net for different kinds of purposes.**(c)**: DeepLabv3+ makes use of (a) and (b).- Further, with the use of
**Modified Aligned Xception**, and**Atrous Separable Convolution**, a faster and stronger network is developed. - Finally, DeepLabv3+ outperforms PSPNet (1st place in 2016 ILSVRC Scene Parsing Challenge) and its previous DeepLabv3.

It is published in **2018 ECCV** with more than **600 citations**. (Sik-Ho Tsang @ Medium)

# Outline

**Atrous Separable Convolution****Encoder-Decoder Architecture****Modified Aligned Xception****Ablation Study****Comparison with State-of-the-art Approaches**

**1. Atrous Separable Convolution**

## 1.1. Atrous Convolution

- For each location
*i*on the output*y*and a filter*w*, atrous convolution is applied over the input feature map*x*where the atrous rate r corresponds to the stride with which we sample the input signal. - (More details on DeepLabv3 about Atrous Convolution.)

## 1.2. **Atrous Separable Convolution**

**(a) and (b), Depthwise Separable Convolution**: It factorize a standard convolution into**a depthwise convolution followed by a point-wise convolution****(i.e., 1×1 convolution)**,**drastically reduces computation complexity**.- This is introduced in MobileNetV1. (If interested. please read my review on MobileNetV1 about Depthwise Separable Convolution.)
**(c) Atrous Depthwise Convolution**: Atrous convolution is supported in the depthwise convolution. And it is found that it significantly reduces the computation complexity of proposed model while maintaining similar (or better) performance.- Combining with point-wise convolution, it is
**Atrous Separable Convolution.**

**2. Encoder-Decoder Architecture**

## 2.1. DeepLabv3 as Encoder

- For the task of
**image classification**, the spatial resolution of the final feature maps is usually 32 times smaller than the input image resolution and thus**output stride = 32**. - For the task of
**semantic segmentation**, it is too small. - One can adopt
**output stride = 16 (or 8)**for denser feature extraction by removing the striding in the last one (or two) block(s) and**applying the atrous convolution**correspondingly. - Additionally, DeepLabv3 augments the
**Atrous Spatial Pyramid Pooling**module, which probes convolutional features at multiple scales by applying atrous convolution with different rates, with the image-level features. - (Please read my DeepLabv3 review for details of encoder and ParseNet review for image-level features.)

## 2.2. Proposed Decoder

- The encoder features are first
**bilinearly upsampled by a factor of 4 and then concatenated with the corresponding low-level features**. - There is
**1×1 convolution on the low-level features**before concatenation to reduce the number of channels, since the corresponding low-level features usually contain a large number of channels (e.g., 256 or 512) which may outweigh the importance of the rich encoder features. - After the concatenation, we apply
**a few 3×3 convolutions to refine the features followed by another simple bilinear upsampling by a factor of 4.** - This is much better comparing the one bilinearly upsampling 16× directly.

**3. Modified Aligned Xception**

## 3.1. Aligned Xception

- Xception is introduced for image classification.
- Then Aligned Xception is introduced in Deformable Convolutional Network (DCN) for object detection.
- The update of Aligned Xception from original Xception is in blue colors.
- To be brief, some of the max pooling operations are replaced by separable conv in the entry flow. The number of repeating is increased from 8 to 16 in the middle flow. One more conv is added in the exit flow.
- (Please read my reviews on Xception and DCN if interested.)

## 3.2. Modified Aligned Xception

- Compared with Aligned Xception, a
**Deeper Xception**network is used. - All max pooling operations are replaced by depthwise separable convolution with striding, in which
**atrous separable convolution**is applied to extract feature maps at an arbitrary resolution. **Extra batch normalization and ReLU**activation are added after each 3×3 depthwise convolution.

# 4. **Ablation Study**

## 4.1. Decoder Design

- ResNet-101 is used as backbone first.

- It is found that
**48 channels**of 1×1 convolution used to reduce the channels of low-level feature map has the best performance.

- And it is most effective to use the Conv2 (before striding) feature map and two extra [3×3 conv; 256 channels] operations.

## 4.2. Model Variants with ResNet as Backbone

**Baseline**(First row block): 77.21% to 79.77% mIOU.**With Decoder**(Second row block): The performance is improved from- 77.21% to 78.85% or 78.51% to 79.35%.
- The performance is further improved to 80.57% when using multi-scale and left-right flipped inputs.
**Coarser feature maps**(Third row block): i.e. stride = 32, the performance is not good.

## 4.3. Modified Aligned Xception as Backbone

**Baseline**(First row block): 79.17% to 81.34% mIOU.**With Decoder**(Second row block): 79.93% to 81.63% mIOU.**Using Depthwise Separable Convolution**(Third row block): Multiply-Adds is significantly reduced by 33% to 41%, while similar mIOU performance is obtained.**Pretraining on COCO**(Fourth row block): Extra 2% improvement.**Pretraining on JFT**(Fifth row block): Extra 0.8% to 1% improvement.

## 4.4. Visualization

# 5. **Comparison with State-of-the-art Approaches**

## 5.1. PASCAL VOC 2012 Test Set

- DeepLabv3+ outperforms many SOTA approaches: LC, ResNet-DUC-HDC (TuSimple), GCN (Large Kernel Matters), RefineNet, ResNet-38, PSPNet, IDW-CNN, SDN, DIS, and DeepLabv3 as shown above.
- (Please feel free to read my reviews on those SOTA approaches.)

## 3.2. Cityscapes

**X-71**: With deeper Modified Aligned Xception network (compared with X-65), and the use of decoder and ASSP, but removing the image-level features, 79.55% mIOU is obtained.- The image-level features are more effective on the PASCAL VOC 2012 dataset.

There are a lot of techniques and approaches which are based on previous SOTA approaches in DeepLabv3+. The story will become too long if I include too many of them. Please feel free to read those reviews. Thanks.

# Reference

[2018 ECCV] [DeepLabv3+]

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

# My Previous Reviews

**Image Classification **[LeNet] [AlexNet] [Maxout] [NIN] [ZFNet] [VGGNet] [Highway] [SPPNet] [PReLU-Net] [STN] [DeepImage] [SqueezeNet] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [ResNet-38] [Shake-Shake] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet] [DRN] [DPN] [Residual Attention Network] [DMRNet / DFN-MR] [IGCNet / IGCV1] [MSDNet] [ShuffleNet V1] [SENet] [NASNet] [MobileNetV2]

**Object Detection **[OverFeat] [R-CNN] [Fast R-CNN] [Faster R-CNN] [MR-CNN & S-CNN] [DeepID-Net] [CRAFT] [R-FCN] [ION] [MultiPathNet] [NoC] [Hikvision] [GBD-Net / GBD-v1 & GBD-v2] [G-RMI] [TDM] [SSD] [DSSD] [YOLOv1] [YOLOv2 / YOLO9000] [YOLOv3] [FPN] [RetinaNet] [DCN]

**Semantic Segmentation **[FCN] [DeconvNet] [DeepLabv1 & DeepLabv2] [CRF-RNN] [SegNet] [ParseNet] [DilatedNet] [DRN] [RefineNet] [GCN] [PSPNet] [DeepLabv3] [ResNet-38] [ResNet-DUC-HDC] [LC] [FC-DenseNet] [IDW-CNN] [DIS] [SDN] [DeepLabv3+]

**Biomedical Image Segmentation **[CUMedVision1] [CUMedVision2 / DCAN] [U-Net] [CFS-FCN] [U-Net+ResNet] [MultiChannel] [V-Net] [3D U-Net] [M²FCN] [SA] [QSA+QNT] [3D U-Net+ResNet] [Cascaded 3D U-Net] [Attention U-Net] [RU-Net & R2U-Net]

**Instance Segmentation **[SDS] [Hypercolumn] [DeepMask] [SharpMask] [MultiPathNet] [MNC] [InstanceFCN] [FCIS]

**Super Resolution **[SRCNN] [FSRCNN] [VDSR] [ESPCN] [RED-Net] [DRCN] [DRRN] [LapSRN & MS-LapSRN] [SRDenseNet] [SR+STN]

**Human Pose Estimation **[DeepPose] [Tompson NIPS’14] [Tompson CVPR’15] [CPM]

**Codec Post-Processing **[ARCNN] [Lin DCC’16] [IFCNN] [Li ICME’17] [VRCNN] [DCAD] [DS-CNN]

**Generative Adversarial Network** [GAN]