Reading: PNASNet — Progressive Neural Architecture Search (Image Classification)

Fewer Compute for Searching Models Compared With NASNet. Outperforms SENet, NASNet-A, and AmoebaNets under the same model capacity.

6 min readSep 20, 2020

In this story, Progressive Neural Architecture Search (PNASNet), by Johns Hopkins University, Google AI, and Stanford University, is presented. In this paper:

A sequential model-based optimization (SMBO) strategy is used such that a surrogate model is learnt to guide the search through structure space.
Direct comparison under the same search space shows that PNASNet is up to 5 times more efficient than the Reinforcement Learning (RL) method, 8 times faster in terms of total compute.

This is a paper in 2018 ECCV with over 700 citations. (Sik-Ho Tsang @ Medium)

Outline

Cell Topologies
PNAS: Progressive Neural Architecture Search
Experimental Results

1. Cell Topologies

**Left**: Cell structure in PNASNet, **Middle**: CIFAR-10 overall structure, **Right**: ImageNet overall structure

First, the outer overall structure is defined first (Middle and Right).
Then, a cell is defined as a fully convolutional network that maps an H×W×F tensor to another H’×W’×F’ tensor.
A block b in a cell c is specified as a 5-tuple, {I1, I2, O1, O2, C}.
Two inputs: I1, I2. Two outputs: O1, O2, One operation C to combine O1, O2.
The set of possible inputs is the the set of all previous blocks in this cell, {Hc1, …, Hcb-1}, plus the output of the previous cell, Hc-1b.
The operator space O has 8 functions: 3×3 depthwise-separable convolution, 5×5 depthwise-separable convolution, 7×7 depthwise-separable convolution, 1×7 followed by 7×1 convolution, identity, 3×3 average pooling, 3×3 max pooling, and 3×3 dilated convolution.
C=1, only addition is used since it is found that concatenation is never used in prior art.
Thus, the search space has the size of:

where |Ib|=2+b-1,|O|=8, and |C|=1.
If we allow cells of up to B = 5 blocks, the total number of cell structures is given by |B1:5| = 2²×8²×3²×8²×4²×8²×5²×8²×6²×8²=5.6×10¹⁴.
There are certain symmetries in this space that allow us to prune it to a more reasonable size. But the size is still large.

2. PNAS: Progressive Neural Architecture Search

2.1. Overall PNAS

It is difficult to directly navigate in an exponentially large search space, especially at the beginning where there is no knowledge of what makes a good model.
Authors propose to search the space in a progressive order, simplest models first. In particular, authors start by constructing all possible cell structures from B1 (i.e., composed of 1 block), and add them to a queue. All the models are trained and evaluated in the queue (in parallel), and then expand each one by adding all of the possible block structures from B2. But there are still so many candidate cells to be trained:

A learned predictor function is referred. This predictor is used to evaluate all the candidate cells, and pick the K most promising ones.

i.e. the predict function in the pseudo codes as shown above within the for loop.
Two predictors are tried: LSTM and MLP. Both models are trained using L1 loss.
K = 256 networks at each stage (136 for stage 1, since there are only 136 unique cells with 1 block),
A maximum cell depth of B = 5 blocks, F = 24 filters are used in the first convolutional cell, the cells are unrolled for N = 2 times, and each child network is trained for 20 epochs.

2.2. LSTM Predictor

An LSTM that reads a sequence of length 4b (representing I1, I2, O1 and O2 for each block), and the input at each step is a one-hot vector of size |Ib| or |O|, followed by embedding lookup.
A shared embedding of dimension D is used for the tokens I1, I2, and another shared embedding for O1, O2. The final LSTM hidden state goes through a fully-connected layer and sigmoid to regress the validation accuracy.
Hidden state size and embedding size are both 100.

2.3. MLP Predictor

Embed each token into an D-dimensional vector, concatenate the embeddings for each block to get an 4D-dimensional vector, and then average over blocks.
The embedding size is 100, 2 fully connected layers are used with 100 hidden units.

2.4. Training The Predictors

**Training and Testing of the Predictors**

When training the predictor, one approach is to update the parameters of the predictor using the new data using a few steps of SGD.
However, since the sample size is very small, an ensemble of 5 predictors are used, each fit (from scratch) to 4/5 of all the data available at each step of the search process. It is observed empirically that this reduced the variance of the predictions.

Therefore, at step b of PNAS, the predictor is trained on the observed performance of cells with up to b blocks, but apply it to cells with b+1 blocks.

3. Experimental Results

3.1. LSTM vs MLP

**Spearman rank correlations of different predictors on the training set, ^b, and when extrapolating to unseen larger models ^b+1**

It is found that RNN (LSTM) is better than MLP when ensemble is not used.
But MLP is better when ensemble is used.

3.2. Search Efficiency

As shown above, PNAS can get higher accuracy than NAS in NASNet with fewer number of models.

3.3. SOTA Comparison on CIFAR-10

PNASNet-5 denote the best CNN discovered on CIFAR using PNAS, as shown at the left of the first figure.
Let M1 be the number of models trained during search, and let E1 be the number of examples used to train each model. The total number of examples is therefore M1E1.
However, for methods with the additional reranking stage, the top M2 models from the search procedure are trained using E2 examples each, before returning the best.
Total cost: M1E1 + M2E2.
PNASNet can find a model with the same accuracy as NASNet, but using 21 times less compute.

3.4. SOTA Comparison on ImageNet

**SOTA Comparison on ImageNet Using Mobile Setting**

**SOTA Comparison on ImageNet Using Large Setting**

Mobile: Input image size is 224×224, and the number of multiply-add operations is under 600M.
Large: Input image size is 331×331.

3.4.1. Mobile Setting

PNASNet-5 achieves slightly better performance than NASNet-A (74.2% top-1 accuracy for PNAS vs 74.0% for NASNet-A).
Both methods significantly surpass the previous state-of-the-art, which includes the manually designed MobileNetV1 (70.6%) and ShuffleNet V1 (70.9%).

3.4.2. Large Setting

PNASNet-5 achieves higher performance (82.9% top-1, 96.2% top-5) than previous state-of-the-art approaches, including SENet, NASNet-A, and AmoebaNets under the same model capacity.

3.5. Intermediate Level PNASNet Models

**Cell structures used in PNASNet-{1, 2, 3, 4}**

There are also the best models found in smaller, intermediate levels, namely b = 1, 2, 3, 4. We call these models PNASNet-{1, 2, 3, 4}.

Reference

[2018 ECCV] [PNASNet]
Progressive Neural Architecture Search

Image Classification

[LeNet] [AlexNet] [Maxout] [NIN] [ZFNet] [VGGNet] [Highway] [SPPNet] [PReLU-Net] [STN] [DeepImage] [SqueezeNet] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [ResNet-38] [Shake-Shake] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet] [DRN] [DPN] [Residual Attention Network] [DMRNet / DFN-MR] [IGCNet / IGCV1] [Deep Roots] [MSDNet] [ShuffleNet V1] [SENet] [NASNet] [MobileNetV2] [CondenseNet] [IGCV2] [IGCV3] [FishNet] [PNASNet] [AmoebaNet]