Reading: AmoebaNet — Regularized Evolution for Image Classifier Architecture Search (Image Classification)

Outperforms Inception-ResNet, ResNeXt, PolyNet, and DPN, On Par With NASNet

Sik-Ho Tsang
7 min readSep 13, 2020


In this story, Regularized Evolution for Image Classifier Architecture Search (AmoebaNet), by Google Brain, is presented.

  • An image classifier, AmoebaNet-A, is evolved, by evolutionary algorithm, that surpasses hand-designs for the first time.
  • The tournament selection evolutionary algorithm is modified by introducing an age property to favor the younger genotypes.
  • 83.9% top-1 / 96.6% top-5 ImageNet accuracy is obtained.

This is a paper in 2019 AAAI with over 700 citations. (Sik-Ho Tsang @ Medium)


  1. Search Space
  2. Aging Evolutionary Algorithm
  3. Experimental Results

1. Search Space

1.1. Fixed Outer Structure

NASNet Search Space
  • First, it is difficult to search the best network architecture from scratch. Something needs to be fixed.
  • Thus, we got the fixed outer structure as shown at the left of the figure. We got the normal cells to extract the features and also the reduction cells to extract and downsample the features.
  • Normal cells are arranged in three stacks of N cells.
  • At the middle of the figure, it is the detailed view for normal cell. Each cell receives a direct input from the previous cell and a skip input from the cell before it.
  • The goal of the architecture-search process is to discover the architectures of the normal and reduction cells, just like the one at the right of the figure.

1.2. Cell Construction Restrictions and Rules

Cell Example
  • As mentioned, there are two inputs. The two cell input tensors are considered hidden states “0” and “1”.
  • More hidden states are then constructed through pairwise combinations.
  • It consists in applying an operation (or op) to an existing hidden state, applying another op to another existing hidden state, and adding the results to produce a new hidden state.
  • Ops belong to a fixed set of common convnet operations such as convolutions and pooling layers.
  • The first pairwise combination applies a 3x3 average pool op to hidden state 0 and a 3x3 max pool op to hidden state 1, in order to produce hidden state 2.
  • The next pairwise combination can now choose from hidden states 0, 1, and 2 to produce hidden state 3 (chose 0 and 1 in Figure 1), and so on.
  • After exactly five pairwise combinations, any hidden states that remain unused (hidden states 5 and 6 in Figure 1) are concatenated to form the output of the cell (hidden state 7).

1.3. Free Parameters to Scale the Network

  • Once the architecture is specified, the model still has two free parameters that can be used to alter its size (and its accuracy): the number of normal cells per stack (N) and the number of output filters of the convolution ops (F).
  • N and F are determined manually.

2. Aging Evolutionary Algorithm

2.1. Overall Algorithm

Aging Evolution
  • The above evolutionary algorithm keeps a population of P trained models throughout the experiment. The population is initialized with models with random architectures.
  • At each cycle, it samples S random models from the population, each drawn uniformly at random with replacement.
  • The model with the highest validation fitness within this sample is selected as the parent.
  • A new architecture, called the child, is constructed from the parent by the application of a transformation called a mutation. A mutation causes a simple and random modification of the architecture.
  • Once the child architecture is constructed, it is then trained, evaluated, and added to the population. This process is called tournament selection.
  • Non-aging evolution: In convention, It is common in tournament selection, to discard (or kill) the worst model in the random S-sample.
  • Proposed aging evolution: killing the oldest model in the population — that is, removing from the population the model that was trained the earliest (“remove dead from left of pop” in Algorithm 1).

This favors the newer models in the population. aging evolution allows us to explore the search space more, instead of zooming in on good models too early.

  • The mutations can be thought of as providing exploration, while the parent selection provides exploitation.
  • The parameter S controls the aggressiveness of the exploitation: S=1 reduces to a type of random search and 2≤S≤P leads to evolution of varying greediness.

2.2. Mutation Variants

  • Two main mutations: the hidden state mutation and the op mutation.
  • A third mutation: identity, is also possible.
  • Only one of these mutations is applied in each cycle, choosing between them at random.
  • The hidden state mutation consists of first making a random choice of whether to modify the normal cell or the reduction cell.
  • Once a cell is chosen, the mutation picks one of the five pairwise combinations uniformly at random.
  • Once the pairwise combination is picked, one of the two elements of the pair is chosen uniformly at random.
  • Constraints: no loops are formed (to keep the feed-forward nature of the convnet).

3. Experimental Results

3.1. Setup

  • Authors first performed architecture search over small models (i.e. small N and F) until 20k models were evaluated.
  • Then, turn them into a full-size, accurate models. The models are enlarged by increasing N and F.
  • Possible ops: none (identity); 3x3, 5x5 and 7x7 separable (sep.) convolutions (convs.); 3x3 average (avg.) pool; 3x3 max pool; 3x3 dilated (dil.) sep. conv.; 1x7 then 7x1 conv.
  • Evolved with P=100, S=25.
  • During the search phase, each model trained for 25 epochs; N=3/F=24, 1 GPU. Each experiment ran on 450 K40 GPUs for 20k models (approx. 7 days).
  • (I believe this scale of experiment can only be done by very large corporations…)
  • Different N and F are also tried actually.

3.2. Comparison with Reinforcement Learning (RL) and Random Search (RS)

Accuracy Against Experiment Time
  • Evolutionary algorithm can get a higher accuracy at early time than Reinforcement Learning (RL) and Random Search (RS).
  • That means evolutionary algorithm can help to reduce the time to search the architecture.
Final augmented models from 5 identical architecture-search experiments for each algorithm, on CIFAR-10
  • The models by Evolutionary Algorithm has lower model cost or higher accuracy compared to RL and RS.

3.3. AmoebaNet-A on CIFAR-10

Left: AmoebaNet-A Architecture, Middle: Normal Cell, Right: Reduction Cell
Test Error on CIFAR-10
  • An evolved architecture with highest validation accuracy is obtained and called AmoebaNet-A.
  • AmoebaNet-A at sizes comparable to a NASNet-A version, showing that AmoebaNet-A is slightly more accurate (when matching model size) or considerably smaller (when matching accuracy).

3.4. AmoebaNet-A on ImageNet

Accuracy on ImageNet
  • The model was evolved on CIFAR-10 and then transferred to ImageNet.
  • When re-trained on ImageNet, AmoebaNet-A performs comparably to the baseline, like NASNet-A, for the same number of parameters (F=190).
  • It outperforms Inception-ResNet, ResNeXt, PolyNet, and DPN.
  • By enlarging it, new state-of-the-art accuracy on ImageNet of 83.9%/96.6% top-1/5 accuracy with 469M parameters is obtained (F=448).

3.5. Additional AmoebaNet

  • AmoebaNet-B was obtained through platform-aware architecture search over a larger version of the NASNet space.
  • AmoebaNet-C is simply a model that showed promise early on in the above experiments by reaching high accuracy with relatively few parameters.
  • AmoebaNet-D was obtained by manually extrapolating the evolutionary process and optimizing the resulting architecture for training speed. It is very efficient: AmoebaNet-D won the Stanford DAWNBench competition for lowest training cost on ImageNet.
  • (If interested, please read the paper for more details.)



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.