# Reading: ENAS — Efficient Neural Architecture Search via Parameter Sharing (Image Classification)

## Speeds up NAS by more than 1000× in terms of GPU hours

--

In this story, **Efficient Neural Architecture Search via Parameter Sharing (ENAS)**, by Google Brain, Carnegie Mellon University, and Stanford University, is presented. In this paper:

**ENAS constructs a large computational graph, where each subgraph represents a neural network architecture**, hence forcing all architectures to share their parameters.**Sharing parameters among child models**allows ENAS to deliver strong empirical performances, whilst**using much fewer GPU-hours**than existing automatic model design approaches.

This is a paper in **2018 ICML **with over **800 citations**. (Sik-Ho Tsang @ Medium)

# Outline

**Representing Search Space as A Single Directed Acyclic Graph (DAG)****Designing Recurrent Cells****Designing Convolutional Networks****Experimental Results**

**1. Representing Search Space as A Single Directed Acyclic Graph (DAG)**

## 1.1. Directed Acyclic Graph (DAG)

- All of the graphs which NAS ends up iterating over can be viewed as sub-graphs of a larger graph. In other words, we can
**represent****NAS****’s search space using a single directed acyclic graph (DAG).** - Intuitively, ENAS’s DAG is the superposition of all possible child models in a search space of NAS, where the
**nodes represent the local computations**and the**edges represent the flow of information.** **ENAS’s design allows parameters to be shared among all child models**, i.e. architectures, in the search space.- In the paper, two models, a language model consists of recurrent cell, and an image classification model consists of CNN, are able to be designed by ENAS.

## 1.2. General Training Procedures

- The controller network is an LSTM with 100 hidden units. This LSTM samples decisions via softmax classifiers, in an autoregressive fashion: the decision in the previous step is fed as input embedding into the next step. At the first step, the controller network receives an empty embedding as input.
- The training procedure of ENAS consists of two interleaving phases.
**The first phase trains**, on a whole pass through the training data set.*W*, the shared parameters of the child models**The second phase trains***θ*, the parameters of the controller LSTM.- These two phases are alternated during the training of ENAS.

# 2. Designing Recurrent Cells

- To design recurrent cells, a DAG with
*N*nodes is employed. **Right**: ENAS’s controller is an RNN that decides: 1) which edges are activated and 2) which computations are performed at each node in the DAG.- To create a recurrent cell, the controller RNN samples
*N*blocks of decisions. An example of*N*=4 is as shown above. **At node 1**: The controller first samples an activation function.

**At node 2**: The controller then samples a previous index and an activation function.

**At node 3**: The controller again samples a previous index and an activation function.

**At node 4**: The controller again samples a previous index and an activation function.

**For the output, we simply average all the loose ends**, i.e. the nodes that are not selected as inputs to any other nodes. In the example, since the indices 3 and 4 were never sampled to be the input for any node, the recurrent cell uses their average (k3 +k4)/2 as its output.

When there is a new model sampled by controller, inNAS, the aboveWith shared parameters, GPU hours are saved.Ware trained from scratched. In ENAS, all recurrent cells in a search space share the same set of parameters.

- Specifically, if the recurrent cell has
and we*N*nodes**allow 4 activation functions (namely tanh, ReLU, identity, and sigmoid)**, then the**search space has (4^**.*N*)×*N*! configurations - In the experiments,
, which means there are*N*= 12**approximately 10¹⁵ models**in the search space. - Below is the trained model:

**First, all non-linearities in the cell are either ReLU or tanh**, even though the search space also has two other functions: identity and sigmoid.**Second, this cell is a local optimum.**When we randomly pick some nodes and switch the non-linearity into identity or sigmoid, the perplexity increases up to 8 points. Similarly when randomly switching some ReLU nodes into tanh or vice versa, the perplexity also increases, but only up to 3 points.**Third, the output of the ENAS cell is an average of 6 nodes.**This behavior is**similar to that of Mixture of Contexts (MoC)**.

**3. Designing Convolutional Networks**

- In the search space for convolutional models, the controller RNN also samples two sets of decisions at each decision block:
**1) what previous nodes to connect to**and**2) what computation operation to use**. - Specifically, at layer
*k*, up to*k*− 1 mutually distinct previous indices are sampled, leading to 2^(*k*−1) possible decisions at layer*k*. - In this example, at layer
*k*= 4, the controller samples previous indices {1, 3}, so the outputs of layers 1 and 3 are concatenated along their depth dimension and sent to layer 4. - The
**6 operations**available for the controller are:**convolutions**with filter sizes**3×3**and**5×5**,**depthwise-separable convolutions**with filter sizes**3×3**and**5×5**, and**max pooling**and**average pooling**of kernel size**3×3**.

## 3.1. Macro Search Space

- Making the described set of decisions for a total of
*L*times, we can sample a network of*L*layers. Since all decisions are independent, there are 6^*L*× 2^(*L*(*L*−1)/2) networks in the search space. In the experiments,*L*= 12, resulting in 1.6 × 10²⁹ possible networks. - Finally, the network comes up as below:

## 3.2. Micro Search Space

- Rather than designing the entire convolutional network, one can design smaller modules and then connect them together to form a network.
- The controller RNN is asked to make two sets of decisions: 1)
**two previous nodes**to be used as inputs to the current node and 2)**two operations**to apply to the two sampled nodes. - The
**5 available operations**are:**identity**,**separable convolution**with kernel size**3 × 3**and**5 × 5**, and**average pooling**and**max pooling**with kernel size**3 × 3**. - After operations at 2 nodes, their results are added.

- An example here with
*B*= 4 nodes: **Nodes 1, 2 are input nodes**, so no decisions are needed for them. Let h1, h2 be the outputs of these nodes.**At node 3**: the controller samples two previous nodes and two operations. It samples node 2, node 2, separable conv 5x5, and identity. This means that h3 = sep conv 5x5(h2) + id(h2).**At node 4**: the controller samples node 3, node 1, avg pool 3x3, and sep conv 3x3. This means that h4 =avg pool 3x3(h3) + sep conv 3x3(h1).- Since all nodes but h4 were used as inputs to at least another node, the only loose end, h4, is treated as the cell’s output. If there are multiple loose ends, they will be concatenated along the depth dimension to form the cell’s output.
**A reduction cell**can also be realized from the search space we discussed, simply by: 1) sampling a computational graph from the search space, and 2) applying all operations with a stride of 2. Hence making the controller RNN run for a total of 2(*B*− 2) blocks.- As all decisions are independent, there are (5 × (
*B*− 2)!)² possible cells. Since we independently sample for a convolutional cell and a reduction cell, the final size of the search space is (5×(*B*−2)!)⁴. **With***B*= 7 as in the experiments, the search space can realize 1.3×10¹¹ final networks, making it significantly smaller than the search space for entire convolutional networks.- And the convolution cell and reduction cell are as follows:

# 4. Experimental Results

## 4.1. Language Model with Penn Treebank

- Running on a single Nvidia GTX 1080Ti GPU,
**ENAS finds a recurrent cell in about 10 hours.** - The ENAS cell achieves a test perplexity of 56.3, which is on par with the existing state-of-the-art of 56.0 achieved by Mixture of Softmaxes (MoS).
- Importantly,
**ENAS cell outperforms****NAS****(****NASNet****) by more than 6 perplexity points**, whilst the search process of ENAS, in terms of GPU hours, is more than**1000× faster.** - (I am not the expert in language model, so I cannot too deep about it. If interested, please read the paper.)

## 4.2. CIFAR-10

- ENAS, uses the network found using macro search space, achieves 4.23% test error.
- If we keep the architecture, but increase the number of filters in the network’s highest layer to 512, then the test error decreases to 3.87%, which is not far away from NAS’s best model, whose test error is 3.65%.
- Impressively,
**ENAS takes about 7 hours to find this architecture**, reducing the number of GPU-hours by more than**50,000× compared to****NAS****.** **ENAS takes 11.5 hours to discover the convolution cell and the reduction cell.**- The controller RNN learned by ENAS is as good as the controller RNN learned by NAS, and that
**the performance gap between****NAS****and ENAS**is due to the fact that**authors do not sample multiple architectures from the trained controller, train them, and then select the best architecture on the validation data.**This extra step benefits NAS’s performance.

## Reference

[2018 ICML] [ENAS]

Efficient Neural Architecture Search via Parameter Sharing

## Image Classification

[LeNet] [AlexNet] [Maxout] [NIN] [ZFNet] [VGGNet] [Highway] [SPPNet] [PReLU-Net] [STN] [DeepImage] [SqueezeNet] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [ResNet-38] [Shake-Shake] [Cutout] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet] [DRN] [DPN] [Residual Attention Network] [DMRNet / DFN-MR] [IGCNet / IGCV1] [Deep Roots] [MSDNet] [ShuffleNet V1] [SENet] [MobileNetV2] [CondenseNet] [IGCV2] [IGCV3] [FishNet] [SqueezeNext] [NASNet] [ENAS] [PNASNet] [AmoebaNet]