Review — Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation

Auto-DeepLab, DeepLab with Neural Architecture Search (NAS)

Comparing Auto-DeepLab against other CNN architectures with two-level hierarchy.
  • In prior arts as above, the network level structure uses pre-defined pattern, which is a stack of module composed of few cell level structures, with some downsampling modules.
  • In this paper, Auto-DeepLab is proposed to search the network level structure in addition to the cell level structure, which forms a hierarchical architecture search space.
  • This is a paper from Li Fei-Fei research group.


  1. Cell Level Search Space
  2. Auto-DeepLab: Network Level Search Space
  3. Auto-DeepLab: Searched Network
  4. Results

1. Cell Level Search Space

  • For the inner cell level, authors reuse the one adopted in NASNet, PNASNet, DARTS, and AmoebaNet [93, 47, 62, 49] to keep consistent with previous works.
  • A cell is a small fully convolutional module. It is a directed acyclic graph consisting of B blocks.
  • Each block is a two-branch structure, mapping from 2 input tensors to 1 output tensor.
  • Block i in cell l may be specified using a 5-tuple (I1, I2, O1, O2, C), where I1, I2 ∈ Il i are selections of input tensors, O1, O2 ∈ O are selections of layer types applied to the corresponding input tensor, and CC is the method used to combine the individual outputs of the two branches to form this block’s output tensor, Hli.
  • The set of possible layer types, O, consists of the following 8 operators, all prevalent in modern CNNs:
  • For the set of possible combination operators C, element-wise addition is the only choice.

2. Auto-DeepLab: Network Level Search Space

Left: Proposed network level search space with L=12. Gray nodes represent the fixed “stem” layers, and a path along the blue nodes represents a candidate network level architecture. Right: During the search, each cell is a densely connected structure.

2.1. Principles

  • Two principles are followed:
  1. The spatial resolution of the next layer is either twice as large, or twice as small, or remains the same.
  2. The smallest spatial resolution is downsampled by 32.
  • The beginning of the network is a two-layer “stem” structure that each reduces the spatial resolution by a factor of 2. After that, there are a total of L layers with unknown spatial resolutions, with the maximum being downsampled by 4 and the minimum being downsampled by 32. Since each layer may differ in spatial resolution by at most 2, the first layer after the stem could only be either downsampled by 4 or 8.

2.2. Can Be Generalized to Prior Arts

Network level architecture used in DeepLabv3.
Network level architecture used in Conv-Deconv (DeconvNet).
Network level architecture used in Stacked Hourglass.
  • The proposed search space is general enough to cover many popular designs, as above.

2.3. Network Level Update

  • Every block’s output tensor Hli is connected to all hidden states in Ili:
  • In addition, each Oji is approximated with its continuous relaxation ¯Oji, defined as:
  • where:
  • In other words, kji are normalized scalars associated with each operator OkO, easily implemented as softmax.
  • As Hl−1 and Hl−2 are always included in Ili, and that Hl is the concatenation of Hl1 , …, HlB. Together with Eq. (1) and Eq. (2), the cell level update may be summarized as:
  • Each layer l will have at most 4 hidden states {4Hl, 8Hl, 16Hl, 32Hl}.
  • A scalar is associated with a gray arrow as in the figure above. The network level update is:
  • where s=4, 8, 16, 32 and l=1, 2, … , L. The scalars are normalized by Softmax such that:
  • At the end, Atrous Spatial Pyramid Pooling (ASPP) modules are attached to each spatial resolution at the L-th layer (atrous rates are adjusted accordingly). Their outputs are bilinear upsampled to the original resolution before summed to produce the prediction.

2.4. Optimization

  • The training data is partitioned into two disjoint sets trainA and trainB.
  • The optimization alternates between:
  1. Update network weights w by ∇wLtrainA(w,α,β).
  2. Update architecture α,β by ∇α,β LtrainB(w,α,β).
  • where the loss function L is the cross entropy calculated on the semantic segmentation mini-batch. The disjoint set partition is to prevent the architecture from overfitting the training data.

2.5. Decoding Discrete Architecture

  • The β values can be interpreted as the “transition probability” between different “states”.
  • Quite intuitively, the goal is to find the path with the “maximum probability” from start to end. This path can be decoded efficiently using the classic Viterbi algorithm, as in the implementation.

3. Auto-DeepLab: Searched Network

The Auto-DeepLab architecture found by our Hierarchical Neural Architecture Search on Cityscapes.

3.1. Searching & Findings

  • L=12. B=5. The network level search space has 2.9 × 10⁴ unique paths, and the number of cell structures is 5.6 × 10¹⁴. So the size of the joint, hierarchical search space is in the order of 10¹⁹.
  • The Atrous Spatial Pyramid Pooling module used in DeepLabv3 has 5 branches: one 1×1 convolution, three 3×3 convolution with various atrous rates, and pooled image feature. During the search, ASPP is simplified to have 3 branches instead of 5 by only using one 3×3 convolution with atrous rate 96/s.
  • In terms of cell level architecture, the conjunction of atrous convolution and depthwise-separable convolution is often used, suggesting that the importance of context has been learned. Yet, atrous convolution is rarely found to be useful in cells for image classification prior art.
  • (Please feel free to read the paper directly for more details.)

3.2. Searched Auto-DeepLab Network

  • The simple encoder-decoder structure similar to DeepLabv3+ is used. Specifically, the encoder consists of the proposed found best network architecture augmented with the ASPP module, and the decoder is the same as the one in DeepLabv3+.
  • Additionally, the “stem” structure is redesigned with three 3×3 convolutions (with stride 2 in the first and third convolutions). The first two convolutions have 64 filters while the third convolution has 128 filters.

4. Experimental Results

4.1. Cityscapes

Validation accuracy during 40 epochs of architecture search optimization across 10 random trials.
Cityscapes validation set results with different Auto-DeepLab model variants. F: the filter multiplier controlling the model capacity.
  • Model capacity by changing the filter multiplier F.
Cityscapes validation set results.
Cityscapes test set results with multi-scale inputs during inference. ImageNet: Models pretrained on ImageNet. Coarse: Models exploit coarse annotations.


PASCAL VOC 2012 validation set results.
PASCAL VOC 2012 test set results.
  • It is lagged behind the top-performing DeepLabv3+ with Xception-65 as network backbone by 2.2%. It is argued that the dataset is too small to train models from scratch and pretraining on ImageNet is still beneficial in this case.

4.3. ADE20K

ADE20K validation set results.


1.6. Semantic Segmentation / Scene Parsing

My Other Previous Paper Readings



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store