Review — Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation

Auto-DeepLab, DeepLab with Neural Architecture Search (NAS)

Sik-Ho Tsang
7 min readNov 22, 2022
Comparing Auto-DeepLab against other CNN architectures with two-level hierarchy.

Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation, Auto-DeepLab, by Johns Hopkins University, Google, & Stanford University, 2019 CVPR, Over 800 Citations (Sik-Ho Tsang @ Medium)
Semantic Segmentation, Neural Architecture Search, NAS

  • In prior arts as above, the network level structure uses pre-defined pattern, which is a stack of module composed of few cell level structures, with some downsampling modules.
  • In this paper, Auto-DeepLab is proposed to search the network level structure in addition to the cell level structure, which forms a hierarchical architecture search space.
  • This is a paper from Li Fei-Fei research group.


  1. Cell Level Search Space
  2. Auto-DeepLab: Network Level Search Space
  3. Auto-DeepLab: Searched Network
  4. Results

1. Cell Level Search Space

  • For the inner cell level, authors reuse the one adopted in NASNet, PNASNet, DARTS, and AmoebaNet [93, 47, 62, 49] to keep consistent with previous works.
  • A cell is a small fully convolutional module. It is a directed acyclic graph consisting of B blocks.
  • Each block is a two-branch structure, mapping from 2 input tensors to 1 output tensor.
  • Block i in cell l may be specified using a 5-tuple (I1, I2, O1, O2, C), where I1, I2 ∈ Il i are selections of input tensors, O1, O2 ∈ O are selections of layer types applied to the corresponding input tensor, and CC is the method used to combine the individual outputs of the two branches to form this block’s output tensor, Hli.
  • The set of possible layer types, O, consists of the following 8 operators, all prevalent in modern CNNs:
  • For the set of possible combination operators C, element-wise addition is the only choice.

2. Auto-DeepLab: Network Level Search Space

Left: Proposed network level search space with L=12. Gray nodes represent the fixed “stem” layers, and a path along the blue nodes represents a candidate network level architecture. Right: During the search, each cell is a densely connected structure.

2.1. Principles

  • Two principles are followed:
  1. The spatial resolution of the next layer is either twice as large, or twice as small, or remains the same.
  2. The smallest spatial resolution is downsampled by 32.
  • The beginning of the network is a two-layer “stem” structure that each reduces the spatial resolution by a factor of 2. After that, there are a total of L layers with unknown spatial resolutions, with the maximum being downsampled by 4 and the minimum being downsampled by 32. Since each layer may differ in spatial resolution by at most 2, the first layer after the stem could only be either downsampled by 4 or 8.

The proposed network level search space is as shown above. The proposed goal is then to find a good path in this L-layer trellis.

2.2. Can Be Generalized to Prior Arts

Network level architecture used in DeepLabv3.
Network level architecture used in Conv-Deconv (DeconvNet).
Network level architecture used in Stacked Hourglass.
  • The proposed search space is general enough to cover many popular designs, as above.

2.3. Network Level Update

  • Every block’s output tensor Hli is connected to all hidden states in Ili:
  • In addition, each Oji is approximated with its continuous relaxation ¯Oji, defined as:
  • where:
  • In other words, kji are normalized scalars associated with each operator OkO, easily implemented as softmax.
  • As Hl−1 and Hl−2 are always included in Ili, and that Hl is the concatenation of Hl1 , …, HlB. Together with Eq. (1) and Eq. (2), the cell level update may be summarized as:
  • Each layer l will have at most 4 hidden states {4Hl, 8Hl, 16Hl, 32Hl}.
  • A scalar is associated with a gray arrow as in the figure above. The network level update is:
  • where s=4, 8, 16, 32 and l=1, 2, … , L. The scalars are normalized by Softmax such that:
  • At the end, Atrous Spatial Pyramid Pooling (ASPP) modules are attached to each spatial resolution at the L-th layer (atrous rates are adjusted accordingly). Their outputs are bilinear upsampled to the original resolution before summed to produce the prediction.

2.4. Optimization

  • The training data is partitioned into two disjoint sets trainA and trainB.
  • The optimization alternates between:
  1. Update network weights w by ∇wLtrainA(w,α,β).
  2. Update architecture α,β by ∇α,β LtrainB(w,α,β).
  • where the loss function L is the cross entropy calculated on the semantic segmentation mini-batch. The disjoint set partition is to prevent the architecture from overfitting the training data.

2.5. Decoding Discrete Architecture

  • The β values can be interpreted as the “transition probability” between different “states”.
  • Quite intuitively, the goal is to find the path with the “maximum probability” from start to end. This path can be decoded efficiently using the classic Viterbi algorithm, as in the implementation.

3. Auto-DeepLab: Searched Network

The Auto-DeepLab architecture found by our Hierarchical Neural Architecture Search on Cityscapes.

3.1. Searching & Findings

  • L=12. B=5. The network level search space has 2.9 × 10⁴ unique paths, and the number of cell structures is 5.6 × 10¹⁴. So the size of the joint, hierarchical search space is in the order of 10¹⁹.
  • The Atrous Spatial Pyramid Pooling module used in DeepLabv3 has 5 branches: one 1×1 convolution, three 3×3 convolution with various atrous rates, and pooled image feature. During the search, ASPP is simplified to have 3 branches instead of 5 by only using one 3×3 convolution with atrous rate 96/s.

In terms of network level architecture, Higher resolution is preferred at both beginning (stays at downsample by 4 for longer) and end (ends at downsample by 8). A general tendency to downsample in the first 3/4 layers and upsample in the last 1/4 layers.

  • In terms of cell level architecture, the conjunction of atrous convolution and depthwise-separable convolution is often used, suggesting that the importance of context has been learned. Yet, atrous convolution is rarely found to be useful in cells for image classification prior art.
  • (Please feel free to read the paper directly for more details.)

3.2. Searched Auto-DeepLab Network

  • The simple encoder-decoder structure similar to DeepLabv3+ is used. Specifically, the encoder consists of the proposed found best network architecture augmented with the ASPP module, and the decoder is the same as the one in DeepLabv3+.
  • Additionally, the “stem” structure is redesigned with three 3×3 convolutions (with stride 2 in the first and third convolutions). The first two convolutions have 64 filters while the third convolution has 128 filters.

4. Experimental Results

4.1. Cityscapes

Validation accuracy during 40 epochs of architecture search optimization across 10 random trials.

The validation accuracy steadily improves throughout the searching process.

Cityscapes validation set results with different Auto-DeepLab model variants. F: the filter multiplier controlling the model capacity.
  • Model capacity by changing the filter multiplier F.

Higher model capacity leads to better performance at the cost of slower speed (indicated by larger Multi-Adds).

Cityscapes validation set results.

Increasing the training iterations from 500K to 1.5M iterations improves the performance by 2.8%.

Additionally, adopting the Scheduled Drop Path [40, 93] further improves the performance by 1.74%, reaching 79.74%.

Cityscapes test set results with multi-scale inputs during inference. ImageNet: Models pretrained on ImageNet. Coarse: Models exploit coarse annotations.

Without any pretraining, the proposed best model (Auto-DeepLab-L) significantly outperforms FRNN-B [60] by 8.6% and GridNet [17] by 10.9%.

With extra coarse annotations, Auto-DeepLab-L, without pretraining on ImageNet, achieves the test set performance of 82.1%, outperforming PSPNet and Mapillary [4], and attains the same performance as DeepLabv3+ while requiring 55.2% fewer Mutli-Adds computations.

Notably, the proposed light-weight model variant, Auto-DeepLab-S, attains 80.9% on the test set, comparable to PSPNet, while using merely 10.15M parameters and 333.25B Multi-Adds.


PASCAL VOC 2012 validation set results.

The best model, Auto-DeepLab-L, with single scale inference significantly outperforms DropBlock by 20.36%.

PASCAL VOC 2012 test set results.

The proposed best model attains the performance of 85.6% on the test set, outperforming RefineNet and PSPNet.

  • It is lagged behind the top-performing DeepLabv3+ with Xception-65 as network backbone by 2.2%. It is argued that the dataset is too small to train models from scratch and pretraining on ImageNet is still beneficial in this case.

4.3. ADE20K

ADE20K validation set results.


[2019 CVPR] [Auto-DeepLab]
Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation

1.6. Semantic Segmentation / Scene Parsing

20152019 [Auto-DeepLab] … 2021 [PVT, PVTv1] [SETR] 2022 [PVTv2]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.