Reading: ENAS — Efficient Neural Architecture Search via Parameter Sharing (Image Classification)

Speeds up NAS by more than 1000× in terms of GPU hours

Sik-Ho Tsang
8 min readOct 3, 2020

In this story, Efficient Neural Architecture Search via Parameter Sharing (ENAS), by Google Brain, Carnegie Mellon University, and Stanford University, is presented. In this paper:

  • ENAS constructs a large computational graph, where each subgraph represents a neural network architecture, hence forcing all architectures to share their parameters.
  • Sharing parameters among child models allows ENAS to deliver strong empirical performances, whilst using much fewer GPU-hours than existing automatic model design approaches.

This is a paper in 2018 ICML with over 800 citations. (Sik-Ho Tsang @ Medium)

Outline

  1. Representing Search Space as A Single Directed Acyclic Graph (DAG)
  2. Designing Recurrent Cells
  3. Designing Convolutional Networks
  4. Experimental Results

1. Representing Search Space as A Single Directed Acyclic Graph (DAG)

1.1. Directed Acyclic Graph (DAG)

The graph represents the entire search space while the red arrows define a model in the search space, which is decided by a controller. Here, node 1 is the input to the model whereas nodes 3 and 6 are the model’s outputs.
  • All of the graphs which NAS ends up iterating over can be viewed as sub-graphs of a larger graph. In other words, we can represent NAS’s search space using a single directed acyclic graph (DAG).
  • Intuitively, ENAS’s DAG is the superposition of all possible child models in a search space of NAS, where the nodes represent the local computations and the edges represent the flow of information.
  • ENAS’s design allows parameters to be shared among all child models, i.e. architectures, in the search space.
  • In the paper, two models, a language model consists of recurrent cell, and an image classification model consists of CNN, are able to be designed by ENAS.

1.2. General Training Procedures

  • The controller network is an LSTM with 100 hidden units. This LSTM samples decisions via softmax classifiers, in an autoregressive fashion: the decision in the previous step is fed as input embedding into the next step. At the first step, the controller network receives an empty embedding as input.
  • The training procedure of ENAS consists of two interleaving phases.
  • The first phase trains W, the shared parameters of the child models, on a whole pass through the training data set.
  • The second phase trains θ, the parameters of the controller LSTM.
  • These two phases are alternated during the training of ENAS.

2. Designing Recurrent Cells

An example of a recurrent cell in the search space with 4 computational nodes. Left: The computational DAG that corresponds to the recurrent cell. The red edges represent the flow of information in the graph. Middle: The recurrent cell. Right: The outputs of the controller RNN that result in the cell in the middle and the DAG on the left.
  • To design recurrent cells, a DAG with N nodes is employed.
  • Right: ENAS’s controller is an RNN that decides: 1) which edges are activated and 2) which computations are performed at each node in the DAG.
  • To create a recurrent cell, the controller RNN samples N blocks of decisions. An example of N=4 is as shown above.
  • At node 1: The controller first samples an activation function.
  • At node 2: The controller then samples a previous index and an activation function.
  • At node 3: The controller again samples a previous index and an activation function.
  • At node 4: The controller again samples a previous index and an activation function.
  • For the output, we simply average all the loose ends, i.e. the nodes that are not selected as inputs to any other nodes. In the example, since the indices 3 and 4 were never sampled to be the input for any node, the recurrent cell uses their average (k3 +k4)/2 as its output.

When there is a new model sampled by controller, in NAS, the above W are trained from scratched. In ENAS, all recurrent cells in a search space share the same set of parameters. With shared parameters, GPU hours are saved.

  • Specifically, if the recurrent cell has N nodes and we allow 4 activation functions (namely tanh, ReLU, identity, and sigmoid), then the search space has (4^NN! configurations.
  • In the experiments, N = 12, which means there are approximately 10¹⁵ models in the search space.
  • Below is the trained model:
The RNN cell ENAS discovered for Penn Treebank
  • First, all non-linearities in the cell are either ReLU or tanh, even though the search space also has two other functions: identity and sigmoid.
  • Second, this cell is a local optimum. When we randomly pick some nodes and switch the non-linearity into identity or sigmoid, the perplexity increases up to 8 points. Similarly when randomly switching some ReLU nodes into tanh or vice versa, the perplexity also increases, but only up to 3 points.
  • Third, the output of the ENAS cell is an average of 6 nodes. This behavior is similar to that of Mixture of Contexts (MoC).

3. Designing Convolutional Networks

An example run of a recurrent cell in the search space with 4 computational nodes, which represent 4 layers in a convolutional network. Top: The output of the controller RNN. Bottom Left: The computational DAG corresponding to the network’s architecture. Red arrows denote the active computational paths. Bottom Right: The complete network. Dotted arrows denote skip connections.
  • In the search space for convolutional models, the controller RNN also samples two sets of decisions at each decision block: 1) what previous nodes to connect to and 2) what computation operation to use.
  • Specifically, at layer k, up to k − 1 mutually distinct previous indices are sampled, leading to 2^(k−1) possible decisions at layer k.
  • In this example, at layer k = 4, the controller samples previous indices {1, 3}, so the outputs of layers 1 and 3 are concatenated along their depth dimension and sent to layer 4.
  • The 6 operations available for the controller are: convolutions with filter sizes 3×3 and 5×5, depthwise-separable convolutions with filter sizes 3×3 and 5×5, and max pooling and average pooling of kernel size 3×3.

3.1. Macro Search Space

  • Making the described set of decisions for a total of L times, we can sample a network of L layers. Since all decisions are independent, there are 6^L × 2^(L(L−1)/2) networks in the search space. In the experiments, L = 12, resulting in 1.6 × 10²⁹ possible networks.
  • Finally, the network comes up as below:
ENAS’s discovered network from the macro search space for image classification

3.2. Micro Search Space

  • Rather than designing the entire convolutional network, one can design smaller modules and then connect them together to form a network.
  • The controller RNN is asked to make two sets of decisions: 1) two previous nodes to be used as inputs to the current node and 2) two operations to apply to the two sampled nodes.
  • The 5 available operations are: identity, separable convolution with kernel size 3 × 3 and 5 × 5, and average pooling and max pooling with kernel size 3 × 3.
  • After operations at 2 nodes, their results are added.
An example run of the controller for the search space over convolutional cells.
  • An example here with B = 4 nodes:
  • Nodes 1, 2 are input nodes, so no decisions are needed for them. Let h1, h2 be the outputs of these nodes.
  • At node 3: the controller samples two previous nodes and two operations. It samples node 2, node 2, separable conv 5x5, and identity. This means that h3 = sep conv 5x5(h2) + id(h2).
  • At node 4: the controller samples node 3, node 1, avg pool 3x3, and sep conv 3x3. This means that h4 =avg pool 3x3(h3) + sep conv 3x3(h1).
  • Since all nodes but h4 were used as inputs to at least another node, the only loose end, h4, is treated as the cell’s output. If there are multiple loose ends, they will be concatenated along the depth dimension to form the cell’s output.
  • A reduction cell can also be realized from the search space we discussed, simply by: 1) sampling a computational graph from the search space, and 2) applying all operations with a stride of 2. Hence making the controller RNN run for a total of 2(B − 2) blocks.
  • As all decisions are independent, there are (5 × (B − 2)!)² possible cells. Since we independently sample for a convolutional cell and a reduction cell, the final size of the search space is (5×(B−2)!)⁴.
  • With B = 7 as in the experiments, the search space can realize 1.3×10¹¹ final networks, making it significantly smaller than the search space for entire convolutional networks.
  • And the convolution cell and reduction cell are as follows:
ENAS cells discovered in the micro search space

4. Experimental Results

4.1. Language Model with Penn Treebank

Test perplexity on Penn Treebank
  • Running on a single Nvidia GTX 1080Ti GPU, ENAS finds a recurrent cell in about 10 hours.
  • The ENAS cell achieves a test perplexity of 56.3, which is on par with the existing state-of-the-art of 56.0 achieved by Mixture of Softmaxes (MoS).
  • Importantly, ENAS cell outperforms NAS (NASNet) by more than 6 perplexity points, whilst the search process of ENAS, in terms of GPU hours, is more than 1000× faster.
  • (I am not the expert in language model, so I cannot too deep about it. If interested, please read the paper.)

4.2. CIFAR-10

Classification errors of ENAS and baselines on CIFAR-10
  • ENAS, uses the network found using macro search space, achieves 4.23% test error.
  • If we keep the architecture, but increase the number of filters in the network’s highest layer to 512, then the test error decreases to 3.87%, which is not far away from NAS’s best model, whose test error is 3.65%.
  • Impressively, ENAS takes about 7 hours to find this architecture, reducing the number of GPU-hours by more than 50,000× compared to NAS.
  • ENAS takes 11.5 hours to discover the convolution cell and the reduction cell.
  • The controller RNN learned by ENAS is as good as the controller RNN learned by NAS, and that the performance gap between NAS and ENAS is due to the fact that authors do not sample multiple architectures from the trained controller, train them, and then select the best architecture on the validation data. This extra step benefits NAS’s performance.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.