Review: DeepLabv3+ — Atrous Separable Convolution (Semantic Segmentation)
Outperforms LC, ResNet-DUC-HDC, GCN, RefineNet, ResNet-38, PSPNet, IDW-CNN, SDN, DIS, and DeepLabv3
Let’s review about DeepLabv3+, which is invented by Google. DeepLab series has come along for versions from DeepLabv1 (2015 ICLR), DeepLabv2 (2018 TPAMI), and DeepLabv3 (arXiv).
- (a): With Atrous Spatial Pyramid Pooling (ASPP), able to encode multi-scale contextual information.
- (b): With Encoder-Decoder Architecture, the location/spatial information is recovered. Encoder-Decoder Architecture has been proved to be useful in literature such as FPN, DSSD, TDM, SharpMask, RED-Net, and U-Net for different kinds of purposes.
- (c): DeepLabv3+ makes use of (a) and (b).
- Further, with the use of Modified Aligned Xception, and Atrous Separable Convolution, a faster and stronger network is developed.
- Finally, DeepLabv3+ outperforms PSPNet (1st place in 2016 ILSVRC Scene Parsing Challenge) and its previous DeepLabv3.
It is published in 2018 ECCV with more than 600 citations. (Sik-Ho Tsang @ Medium)
Outline
- Atrous Separable Convolution
- Encoder-Decoder Architecture
- Modified Aligned Xception
- Ablation Study
- Comparison with State-of-the-art Approaches
1. Atrous Separable Convolution
1.1. Atrous Convolution
- For each location i on the output y and a filter w, atrous convolution is applied over the input feature map x where the atrous rate r corresponds to the stride with which we sample the input signal.
- (More details on DeepLabv3 about Atrous Convolution.)
1.2. Atrous Separable Convolution
- (a) and (b), Depthwise Separable Convolution: It factorize a standard convolution into a depthwise convolution followed by a point-wise convolution (i.e., 1×1 convolution), drastically reduces computation complexity.
- This is introduced in MobileNetV1. (If interested. please read my review on MobileNetV1 about Depthwise Separable Convolution.)
- (c) Atrous Depthwise Convolution: Atrous convolution is supported in the depthwise convolution. And it is found that it significantly reduces the computation complexity of proposed model while maintaining similar (or better) performance.
- Combining with point-wise convolution, it is Atrous Separable Convolution.
2. Encoder-Decoder Architecture
2.1. DeepLabv3 as Encoder
- For the task of image classification, the spatial resolution of the final feature maps is usually 32 times smaller than the input image resolution and thus output stride = 32.
- For the task of semantic segmentation, it is too small.
- One can adopt output stride = 16 (or 8) for denser feature extraction by removing the striding in the last one (or two) block(s) and applying the atrous convolution correspondingly.
- Additionally, DeepLabv3 augments the Atrous Spatial Pyramid Pooling module, which probes convolutional features at multiple scales by applying atrous convolution with different rates, with the image-level features.
- (Please read my DeepLabv3 review for details of encoder and ParseNet review for image-level features.)
2.2. Proposed Decoder
- The encoder features are first bilinearly upsampled by a factor of 4 and then concatenated with the corresponding low-level features.
- There is 1×1 convolution on the low-level features before concatenation to reduce the number of channels, since the corresponding low-level features usually contain a large number of channels (e.g., 256 or 512) which may outweigh the importance of the rich encoder features.
- After the concatenation, we apply a few 3×3 convolutions to refine the features followed by another simple bilinear upsampling by a factor of 4.
- This is much better comparing the one bilinearly upsampling 16× directly.
3. Modified Aligned Xception
3.1. Aligned Xception
- Xception is introduced for image classification.
- Then Aligned Xception is introduced in Deformable Convolutional Network (DCN) for object detection.
- The update of Aligned Xception from original Xception is in blue colors.
- To be brief, some of the max pooling operations are replaced by separable conv in the entry flow. The number of repeating is increased from 8 to 16 in the middle flow. One more conv is added in the exit flow.
- (Please read my reviews on Xception and DCN if interested.)
3.2. Modified Aligned Xception
- Compared with Aligned Xception, a Deeper Xception network is used.
- All max pooling operations are replaced by depthwise separable convolution with striding, in which atrous separable convolution is applied to extract feature maps at an arbitrary resolution.
- Extra batch normalization and ReLU activation are added after each 3×3 depthwise convolution.
4. Ablation Study
4.1. Decoder Design
- ResNet-101 is used as backbone first.
- It is found that 48 channels of 1×1 convolution used to reduce the channels of low-level feature map has the best performance.
- And it is most effective to use the Conv2 (before striding) feature map and two extra [3×3 conv; 256 channels] operations.
4.2. Model Variants with ResNet as Backbone
- Baseline (First row block): 77.21% to 79.77% mIOU.
- With Decoder (Second row block): The performance is improved from
- 77.21% to 78.85% or 78.51% to 79.35%.
- The performance is further improved to 80.57% when using multi-scale and left-right flipped inputs.
- Coarser feature maps (Third row block): i.e. stride = 32, the performance is not good.
4.3. Modified Aligned Xception as Backbone
- Baseline (First row block): 79.17% to 81.34% mIOU.
- With Decoder (Second row block): 79.93% to 81.63% mIOU.
- Using Depthwise Separable Convolution (Third row block): Multiply-Adds is significantly reduced by 33% to 41%, while similar mIOU performance is obtained.
- Pretraining on COCO (Fourth row block): Extra 2% improvement.
- Pretraining on JFT (Fifth row block): Extra 0.8% to 1% improvement.
4.4. Visualization
5. Comparison with State-of-the-art Approaches
5.1. PASCAL VOC 2012 Test Set
- DeepLabv3+ outperforms many SOTA approaches: LC, ResNet-DUC-HDC (TuSimple), GCN (Large Kernel Matters), RefineNet, ResNet-38, PSPNet, IDW-CNN, SDN, DIS, and DeepLabv3 as shown above.
- (Please feel free to read my reviews on those SOTA approaches.)
3.2. Cityscapes
- X-71: With deeper Modified Aligned Xception network (compared with X-65), and the use of decoder and ASSP, but removing the image-level features, 79.55% mIOU is obtained.
- The image-level features are more effective on the PASCAL VOC 2012 dataset.
There are a lot of techniques and approaches which are based on previous SOTA approaches in DeepLabv3+. The story will become too long if I include too many of them. Please feel free to read those reviews. Thanks.
Reference
[2018 ECCV] [DeepLabv3+]
Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation
My Previous Reviews
Image Classification [LeNet] [AlexNet] [Maxout] [NIN] [ZFNet] [VGGNet] [Highway] [SPPNet] [PReLU-Net] [STN] [DeepImage] [SqueezeNet] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [ResNet-38] [Shake-Shake] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet] [DRN] [DPN] [Residual Attention Network] [DMRNet / DFN-MR] [IGCNet / IGCV1] [MSDNet] [ShuffleNet V1] [SENet] [NASNet] [MobileNetV2]
Object Detection [OverFeat] [R-CNN] [Fast R-CNN] [Faster R-CNN] [MR-CNN & S-CNN] [DeepID-Net] [CRAFT] [R-FCN] [ION] [MultiPathNet] [NoC] [Hikvision] [GBD-Net / GBD-v1 & GBD-v2] [G-RMI] [TDM] [SSD] [DSSD] [YOLOv1] [YOLOv2 / YOLO9000] [YOLOv3] [FPN] [RetinaNet] [DCN]
Semantic Segmentation [FCN] [DeconvNet] [DeepLabv1 & DeepLabv2] [CRF-RNN] [SegNet] [ParseNet] [DilatedNet] [DRN] [RefineNet] [GCN] [PSPNet] [DeepLabv3] [ResNet-38] [ResNet-DUC-HDC] [LC] [FC-DenseNet] [IDW-CNN] [DIS] [SDN] [DeepLabv3+]
Biomedical Image Segmentation [CUMedVision1] [CUMedVision2 / DCAN] [U-Net] [CFS-FCN] [U-Net+ResNet] [MultiChannel] [V-Net] [3D U-Net] [M²FCN] [SA] [QSA+QNT] [3D U-Net+ResNet] [Cascaded 3D U-Net] [Attention U-Net] [RU-Net & R2U-Net]
Instance Segmentation [SDS] [Hypercolumn] [DeepMask] [SharpMask] [MultiPathNet] [MNC] [InstanceFCN] [FCIS]
Super Resolution [SRCNN] [FSRCNN] [VDSR] [ESPCN] [RED-Net] [DRCN] [DRRN] [LapSRN & MS-LapSRN] [SRDenseNet] [SR+STN]
Human Pose Estimation [DeepPose] [Tompson NIPS’14] [Tompson CVPR’15] [CPM]
Codec Post-Processing [ARCNN] [Lin DCC’16] [IFCNN] [Li ICME’17] [VRCNN] [DCAD] [DS-CNN]
Generative Adversarial Network [GAN]