# Review — Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation

## Auto-DeepLab, DeepLab with Neural Architecture Search (NAS)

--

Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation, Auto-DeepLab, by Johns Hopkins University, Google, & Stanford University,2019 CVPR, Over 800 Citations(Sik-Ho Tsang @ Medium)

Semantic Segmentation, Neural Architecture Search, NAS

- In
**prior arts**as above, the**network level structure uses pre-defined pattern**, which is a stack of module composed of few cell level structures, with some downsampling modules. - In this paper,
**Auto-DeepLab**is proposed to**search the network level structure in addition to the cell level structure**, which forms a hierarchical architecture search space. - This is a paper from Li Fei-Fei research group.

# Outline

**Cell Level Search Space****Auto-DeepLab: Network Level Search Space****Auto-DeepLab: Searched Network****Results**

**1. Cell Level Search Space**

- For the
**inner cell level**, authors**reuse the one adopted in****NASNet****,****PNASNet****,****DARTS****, and****AmoebaNet****[93, 47, 62, 49]**to keep consistent with previous works. **A cell**is a**small fully convolutional module**. It is a directed acyclic graph consisting of*B*blocks.- Each block is a
**two-branch structure**, mapping**from 2 input tensors to 1 output tensor.** **Block**may be specified using a*i*in cell*l***5-tuple (**, where*I*1,*I*2,*O*1,*O*2,*C*)i are selections of*I*1,*I*2 ∈*I*l**input tensors**,are*O*1,*O*2 ∈*O***selections of layer types applied to the corresponding input tensor**, andis the method used to*C*∈*C***combine the individual outputs of the two branches**to form this block’s**output tensor,***Hli*.**The set of possible layer types,**, consists of the following*O***8 operators**, all prevalent in modern CNNs:

- For the set of possible
**combination**operators*C*,**element-wise addition**is the only choice.

**2. Auto-DeepLab: **Network Level Search Space

## 2.1. Principles

**Two principles**are followed:

- The spatial
**resolution**of the next layer is**either twice as large**, or**twice as small**, or**remains the same.** - The
**smallest**spatial resolution is**downsampled by 32**.

- The beginning of the network is
**a two-layer “stem” structure**that each reduces the spatial resolution by a factor of 2. After that, there are**a total of**with unknown spatial resolutions, with the*L*layers**maximum being downsampled by 4 and the minimum being downsampled by 32.**Since each layer may differ in spatial resolution by at most 2, the first layer after the stem could only be either downsampled by 4 or 8.

The proposed

network level search spaceis as shown above. The proposed goal is then tofind a good path in thisL-layer trellis.

## 2.2. Can Be Generalized to Prior Arts

- The proposed search space is general enough to cover many popular designs, as above.

## 2.3. Network Level Update

**Every block’s output tensor**is*Hli***connected to all hidden states in***Ili*:

- In addition,
**each**, defined as:*Oj*→*i*is approximated with its continuous relaxation ¯*Oj*→*i*

- where:

- In other words,
*kj*→*i*are normalized scalars associated with each operator*Ok*∈*O***softmax**. - As
and*Hl*−1*Hl*−2*Ili*, and that*Hl*is the concatenation of*Hl1*, …,*HlB*. Together with Eq. (1) and Eq. (2),**the cell level update**may be summarized as:

**Each layer**will have*l***at most 4 hidden states {4**.*Hl*, 8*Hl*, 16*Hl*, 32*Hl*}**A scalar**is associated with**a gray arrow**as in the figure above. The**network level update**is:

- where
*s*=4, 8, 16, 32 and*l*=1, 2, … ,*L*.**The scalars are normalized by Softmax**such that:

- At the end,
**Atrous Spatial Pyramid Pooling (ASPP)**modules are attached to each spatial resolution at the*L*-th layer (atrous rates are adjusted accordingly). Their outputs are**bilinear****upsampled**to the original resolution before**summed**to produce the prediction.

## 2.4. Optimization

- The training data is partitioned into
**two disjoint sets***trainA*and*trainB*. - The
**optimization**alternates between:

**Update network weights***w*by*∇wLtrainA(w,α,β).***Update architecture***α,β*by*∇α,β LtrainB(w,α,β).*

- where the loss function
is the*L***cross entropy calculated on the semantic segmentation mini-batch**. The disjoint set partition is to prevent the architecture from overfitting the training data.

## 2.5. Decoding Discrete Architecture

- The
values can be interpreted as the*β***“transition probability” between different “states”.** - Quite intuitively, the goal is to
**find the path with the “maximum probability” from start to end.**This path**can be decoded efficiently using the classic Viterbi algorithm**, as in the implementation.

**3. Auto-DeepLab: Searched Network**

## 3.1. Searching & Findings

The*L*=12.*B*=5.**network level search space**has**2.9 × 10⁴ unique paths**, and the**number of cell structures**is**5.6 × 10¹⁴**. So the size of the joint,**hierarchical search space**is in the**order of 10¹⁹**.**The Atrous Spatial Pyramid Pooling module**used in DeepLabv3 has 5 branches: one 1×1 convolution, three 3×3 convolution with various atrous rates, and pooled image feature. During the search, ASPP is simplified to have**3 branches**instead of 5 by**only using one 3×3 convolution with atrous**rate 96/*s*.

In terms of

network levelarchitecture,Higher resolution is preferred at both beginning(stays at downsample by 4 for longer)and end(ends at downsample by 8). A general tendency todownsample in the first 3/4 layersandupsample in the last 1/4 layers.

- In terms of
**cell level**architecture, the conjunction of**atrous convolution and depthwise-separable convolution is often used**, suggesting that the importance of context has been learned. Yet, atrous convolution is rarely found to be useful in cells for image classification prior art. - (Please feel free to read the paper directly for more details.)

## 3.2. Searched Auto-DeepLab Network

- The simple encoder-decoder structure similar to DeepLabv3+ is used. Specifically, the
**encoder**consists of the**proposed found best network architecture augmented with the ASPP module**, and the**decoder**is the**same as the one in****DeepLabv3+****.** - Additionally,
**the “stem” structure is redesigned with three 3×3 convolutions**(with stride 2 in the first and third convolutions). The first two convolutions have 64 filters while the third convolution has 128 filters.

# 4. Experimental Results

## 4.1. **Cityscapes**

The validation accuracy

steadily improvesthroughout the searching process.

- Model capacity by changing the filter multiplier
*F*.

Higher model capacityleads tobetter performance at the cost of slower speed(indicated by larger Multi-Adds).

Increasing the training iterationsfrom 500K to 1.5M iterationsimproves the performance by 2.8%.Additionally, adopting the

Scheduled Drop Path [40, 93]furtherimproves the performance by 1.74%, reaching 79.74%.

Without any pretraining, the

proposed best model (Auto-DeepLab-L)significantly outperformsFRNN-B [60] by 8.6% and GridNet [17] by 10.9%.With extra coarse annotations,

Auto-DeepLab-L, without pretraining on ImageNet, achieves thetest set performance of 82.1%, outperformingPSPNetand Mapillary [4], and attains thesame performance asDeepLabv3+while requiring 55.2% fewer Mutli-Adds computations.Notably, the proposed

light-weightmodel variant,Auto-DeepLab-S, attains80.9% on the test set, comparable to PSPNet, while using merely 10.15M parameters and 333.25B Multi-Adds.

## 4.2. PASCAL VOC

The best model,

Auto-DeepLab-L, with single scale inferencesignificantly outperformsDropBlockby 20.36%.

The proposed best model attains the performance of

85.6%on the test set,outperformingRefineNetandPSPNet.

- It is
**lagged behind the top-performing****DeepLabv3+****with****Xception****-65 as network backbone by 2.2%**. It is argued that the**dataset is too small to****train models from scratch**and pretraining on ImageNet is still beneficial in this case.

## 4.3. ADE20K

## Reference

[2019 CVPR] [Auto-DeepLab]

Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation

## 1.6. Semantic Segmentation / Scene Parsing

**2015** … **2019** [Auto-DeepLab] … **2021** [PVT, PVTv1] [SETR] **2022 **[PVTv2]