Review: DeepLabv3+ — Atrous Separable Convolution (Semantic Segmentation)

Outperforms LC, ResNet-DUC-HDC, GCN, RefineNet, ResNet-38, PSPNet, IDW-CNN, SDN, DIS, and DeepLabv3

7 min readSep 29, 2019

Let’s review about DeepLabv3+, which is invented by Google. DeepLab series has come along for versions from DeepLabv1 (2015 ICLR), DeepLabv2 (2018 TPAMI), and DeepLabv3 (arXiv).

(a): With Atrous Spatial Pyramid Pooling (ASPP), able to encode multi-scale contextual information.
(b): With Encoder-Decoder Architecture, the location/spatial information is recovered. Encoder-Decoder Architecture has been proved to be useful in literature such as FPN, DSSD, TDM, SharpMask, RED-Net, and U-Net for different kinds of purposes.
(c): DeepLabv3+ makes use of (a) and (b).
Further, with the use of Modified Aligned Xception, and Atrous Separable Convolution, a faster and stronger network is developed.
Finally, DeepLabv3+ outperforms PSPNet (1st place in 2016 ILSVRC Scene Parsing Challenge) and its previous DeepLabv3.

It is published in 2018 ECCV with more than 600 citations. (Sik-Ho Tsang @ Medium)

Outline

Atrous Separable Convolution
Encoder-Decoder Architecture
Modified Aligned Xception
Ablation Study
Comparison with State-of-the-art Approaches

1. Atrous Separable Convolution

1.1. Atrous Convolution

**Atrous Convolution with Different Rates r**

For each location i on the output y and a filter w, atrous convolution is applied over the input feature map x where the atrous rate r corresponds to the stride with which we sample the input signal.
(More details on DeepLabv3 about Atrous Convolution.)

1.2. Atrous Separable Convolution

**Depthwise Separable Convolution Using Atrous Convolution**

(a) and (b), Depthwise Separable Convolution: It factorize a standard convolution into a depthwise convolution followed by a point-wise convolution (i.e., 1×1 convolution), drastically reduces computation complexity.
This is introduced in MobileNetV1. (If interested. please read my review on MobileNetV1 about Depthwise Separable Convolution.)
(c) Atrous Depthwise Convolution: Atrous convolution is supported in the depthwise convolution. And it is found that it significantly reduces the computation complexity of proposed model while maintaining similar (or better) performance.
Combining with point-wise convolution, it is Atrous Separable Convolution.

2. Encoder-Decoder Architecture

2.1. DeepLabv3 as Encoder

For the task of image classification, the spatial resolution of the final feature maps is usually 32 times smaller than the input image resolution and thus output stride = 32.
For the task of semantic segmentation, it is too small.
One can adopt output stride = 16 (or 8) for denser feature extraction by removing the striding in the last one (or two) block(s) and applying the atrous convolution correspondingly.
Additionally, DeepLabv3 augments the Atrous Spatial Pyramid Pooling module, which probes convolutional features at multiple scales by applying atrous convolution with different rates, with the image-level features.
(Please read my DeepLabv3 review for details of encoder and ParseNet review for image-level features.)

2.2. Proposed Decoder

The encoder features are first bilinearly upsampled by a factor of 4 and then concatenated with the corresponding low-level features.
There is 1×1 convolution on the low-level features before concatenation to reduce the number of channels, since the corresponding low-level features usually contain a large number of channels (e.g., 256 or 512) which may outweigh the importance of the rich encoder features.
After the concatenation, we apply a few 3×3 convolutions to refine the features followed by another simple bilinear upsampling by a factor of 4.
This is much better comparing the one bilinearly upsampling 16× directly.

3. Modified Aligned Xception

3.1. Aligned Xception

Xception is introduced for image classification.
Then Aligned Xception is introduced in Deformable Convolutional Network (DCN) for object detection.
The update of Aligned Xception from original Xception is in blue colors.
To be brief, some of the max pooling operations are replaced by separable conv in the entry flow. The number of repeating is increased from 8 to 16 in the middle flow. One more conv is added in the exit flow.
(Please read my reviews on Xception and DCN if interested.)

3.2. Modified Aligned Xception

Compared with Aligned Xception, a Deeper Xception network is used.
All max pooling operations are replaced by depthwise separable convolution with striding, in which atrous separable convolution is applied to extract feature maps at an arbitrary resolution.
Extra batch normalization and ReLU activation are added after each 3×3 depthwise convolution.

4. Ablation Study

4.1. Decoder Design

ResNet-101 is used as backbone first.

**Effects of different number of channels of 1×1 convolution on PASCAL VOC 2012 val set**

It is found that 48 channels of 1×1 convolution used to reduce the channels of low-level feature map has the best performance.

And it is most effective to use the Conv2 (before striding) feature map and two extra [3×3 conv; 256 channels] operations.

4.2. Model Variants with ResNet as Backbone

Baseline (First row block): 77.21% to 79.77% mIOU.
With Decoder (Second row block): The performance is improved from
77.21% to 78.85% or 78.51% to 79.35%.
The performance is further improved to 80.57% when using multi-scale and left-right flipped inputs.
Coarser feature maps (Third row block): i.e. stride = 32, the performance is not good.

4.3. Modified Aligned Xception as Backbone

Baseline (First row block): 79.17% to 81.34% mIOU.
With Decoder (Second row block): 79.93% to 81.63% mIOU.
Using Depthwise Separable Convolution (Third row block): Multiply-Adds is significantly reduced by 33% to 41%, while similar mIOU performance is obtained.
Pretraining on COCO (Fourth row block): Extra 2% improvement.
Pretraining on JFT (Fifth row block): Extra 0.8% to 1% improvement.

4.4. Visualization

**val set visualization, last row is failure mode**

5. Comparison with State-of-the-art Approaches

5.1. PASCAL VOC 2012 Test Set

DeepLabv3+ outperforms many SOTA approaches: LC, ResNet-DUC-HDC (TuSimple), GCN (Large Kernel Matters), RefineNet, ResNet-38, PSPNet, IDW-CNN, SDN, DIS, and DeepLabv3 as shown above.
(Please feel free to read my reviews on those SOTA approaches.)

3.2. Cityscapes

X-71: With deeper Modified Aligned Xception network (compared with X-65), and the use of decoder and ASSP, but removing the image-level features, 79.55% mIOU is obtained.
The image-level features are more effective on the PASCAL VOC 2012 dataset.

DeepLabv3+ outperforms SOTA approaches such as ResNet-38 and PSPNet.

There are a lot of techniques and approaches which are based on previous SOTA approaches in DeepLabv3+. The story will become too long if I include too many of them. Please feel free to read those reviews. Thanks.