Review: NASNet — Neural Architecture Search Network (Image Classification)

Outperforms or Comparable With Inception-v2, Inception-v3, Xception, ResNet, Inception-ResNet-v2, PolyNet, ResNeXt, Shake-Shake, DenseNet, DPN, SENet, MobileNetV1, ShuffleNet V1

6 min readMay 18, 2019

In this story, NASNet, by Google Brain, is reviewed. Authors propose to search for an architectural building block on a small dataset and then transfer the block to a larger dataset. Particularly, they search for the best convolutional layer or cell on CIFAR-10 first, then apply this cell to the ImageNet by stacking together more copies of this cell. A new regularization technique called ScheduledDropPath is also proposed which significantly improves the generalization in the NASNet models. At last, NASNet model achieves state-of-the-art results with smaller model size and lower complexity (FLOPs). This is a paper in 2018 CVPR with more than 400 citations. (Sik-Ho Tsang @ Medium)

Outline

Neural Architecture Search (NAS) for Cells
Controller Model Architecture
NASNet-A, NASNet-B & NASNet-C
Experimental Results

1. Neural Architecture Search (NAS) for Cells

**Scalable Architectures for CIFAR-10 and ImageNet**

In NASNet, though the overall architecture is predefined as shown above, the blocks or cells are not predefined by authors. Instead, they are searched by reinforcement learning search method.
i.e. the number of motif repetitions N and the number of initial convolutional filters are as free parameters, and used for scaling.
Specifically, these cells are called Normal Cell and Reduction Cell.
Normal Cell: Convolutional cells that return a feature map of the same dimension.
Reduction Cell: Convolutional cells that return a feature map where the feature map height and width is reduced by a factor of two.
Only the structures of (or within) the Normal and Reduction Cells are searched by the controller RNN (Recurrent Neural Network).

2. Controller Model Architecture

**Controller model architecture for recursively constructing one block of a convolutional cell**

The controller RNN recursively predicts the rest of the structure of the convolutional cell, given these two initial hidden states.
Step 1: Select a hidden state from hi, hi-1 or from the set of hidden states created in previous blocks.
Step 2: Select a second hidden state from the same options as in Step 1.
Step 3: Select an operation to apply to the hidden state selected in Step 1.
Step 4: Select an operation to apply to the hidden state selected in Step 2.
Step 5: Select a method to combine the outputs of Step 3 and 4 to create a new hidden state.
There is a set of operations to be selected:

The above is only for ONE block.
Specifically, the controller RNN is a one-layer LSTM with 100 hidden units at each layer and 2×5B softmax predictions for the two convolutional cells (where B is typically 5) associated with each architecture decision.
Each of the 10B predictions of the controller RNN is associated with a probability. The joint probability of a child network is the product of all probabilities at these 10B softmaxes. This joint probability is used to compute the gradient for the controller RNN.
The gradient is scaled by the validation accuracy of the child network to update the controller RNN such that the controller assigns low probabilities for bad child networks and high probabilities for good child networks.

**Schematic diagram of the NASNet search space**

Network motifs are constructed recursively in stages termed blocks.
Each block consists of the controller selecting a pair of hidden states (dark gray), operations to perform on those hidden states (yellow) and a combination operation (green).
The resulting hidden state is retained in the set of potential hidden states to be selected on subsequent blocks.
(In general, NASNet tries to find the best combinations from a set of operations through controller RNN to form a cell with the best performance instead of designing the block using hand-crafted decision.)

3. NASNet-A, NASNet-B & NASNet-C

The result of this search process over 4 days yields several candidate convolutional cells, with the use of 500 GPUs resulting in 2,000 GPU-hours !!!!!
Finally , NASNet-A, NASNet-B & NASNet-C Normal & Reduction Cells are formed.

4. Experimental Results

4.1 ScheduledDropPath

During training, ScheduledDropPath, each path in the cell is dropped out with a probability that is linearly increased over the course of training, which improves the accuracy significantly.

4.2. CIFAR-10

NASNet-A (7 @ 2304) model with cutout data augmentation achieves a state-of-the-art error rate of 2.40%.
7 means N=7, i.e. number of cells repeated, and 2304 means the number of filters in the penultimate layer of the network.
And it outperforms state-of-the-art approaches such as DenseNet and Shake-Shake.

4.3. ImageNet

The architectures from CIFAR-10 are transferred to ImageNet, but all ImageNet models weights are trained from scratch.

**Accuracy versus Computational Demand (Left) and Number of Parameters (Right)**

NASNets achieve state-of-the-art performances with fewer floating point operations and parameters than comparable architectures.
The convolutional cells discovered with CIFAR-10 generalize well to ImageNet problems.

Importantly, the largest model achieves a new state-of-the-art performance for ImageNet (82.7%) based on single, non-ensembled predictions, surpassing previous best published result (DPN) by 1.2%. Among the unpublished works (SENet), NASNet is on par with the best reported result of 82.7% .