Review — Swish: Searching for Activation Functions

Swish: f(x)=x • sigmoid(βx)

Swish Activation Function
  • Search space is defined to search for a better activation function.
  • Finally, Swish is come up with better performance than ReLU.


  1. Search Space
  2. NAS Search
  3. Experimental Results

1. Search Space

An example activation function structure
  • The activation function is composed of multiple repetitions of the “core unit”, which consists of two inputs, two unary functions, and one binary function.
A Set of Unary and Binary Function Candidates
  • Unary functions take in a single scalar input and return a single scalar output, such u(x)=x² or u(x)=σ(x).
  • Binary functions take in two scalar inputs and return a single scalar output, such as b(x1, x2)=x1•x2 or b(x1, x2)=exp(-(x1-x2)²).

2. NAS Search

The RNN controller used to search over large spaces
  • At each timestep, the controller predicts a single component of the activation function. The prediction is fed back to the controller in the next timestep, and this process is repeated until every component of the activation function is predicted. The predicted string is then used to construct the activation function.
  • A “child network” with the candidate activation function is trained on some task, such as image classification on CIFAR-10. After training, the validation accuracy of the child network is recorded and used to update the search algorithm.
  • ResNet-20 is used as child on CIFAR-10.
The top novel activation functions found by the searches. Separated into two diagrams for visual clarity.
  • The above figure plots the top performing novel activation functions found by the searches.
  • ResNet, WRN, and DenseNet are used for evaluation.
  • Complicated activation functions consistently underperform simpler activation functions, potentially due to an increased difficulty in optimization.
  • Functions that use division tend to perform poorly because the output explodes when the denominator is near 0.
  • Six of the eight activation functions successfully generalize. Of these six activation functions, all match or outperform ReLU on ResNet-164.

3. Experimental Results

3.1. Swish Benchmarking

The number of models on which Swish outperforms, is equivalent to, or underperforms each baseline activation function
  • Swish is benchmarked against ReLU and a number of recently proposed activation functions on challenging datasets, and it is found that Swish matches or exceeds the baselines on nearly all tasks.

3.2. CIFAR

Left: CIFAR-10 Accuracy, Right: CIFAR-100 Accuracy
  • Swish is with a trainable β and Swish-1 is with a fixed β=1.

3.3. ImageNet

Left: Training curves of Mobile NASNet-A on ImageNet, Right: Mobile NASNet-A on ImageNet
Left: Inception-ResNet-v2 on ImageNet, Right: MobileNet on ImageNet
Left: Inception-v3 on ImageNet, Right: Inception-v4 on ImageNet

3.4. WMT 2014 English→German Machine Translation

BLEU score of a 12 layer Transformer on WMT English→German
  • Swish outperforms or matches the other baselines on machine translation.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store