Review — Swish: Searching for Activation Functions

Swish: f(x)=x • sigmoid(βx)

Sik-Ho Tsang
4 min readFeb 16, 2022
Swish Activation Function

Searching for Activation Functions
Swish, by Google Brain
2018 ICLRW, Over 1600 Citations (Sik-Ho Tsang @ Medium)
Activation Function, Image Classification, Neural Machine Translation, Natural Language Processing, NLP

  • Search space is defined to search for a better activation function.
  • Finally, Swish is come up with better performance than ReLU.


  1. Search Space
  2. NAS Search
  3. Experimental Results

1. Search Space

An example activation function structure

A simple search space is designed that composes unary and binary functions to construct the activation function.

  • The activation function is composed of multiple repetitions of the “core unit”, which consists of two inputs, two unary functions, and one binary function.
A Set of Unary and Binary Function Candidates
  • Unary functions take in a single scalar input and return a single scalar output, such u(x)=x² or u(x)=σ(x).
  • Binary functions take in two scalar inputs and return a single scalar output, such as b(x1, x2)=x1•x2 or b(x1, x2)=exp(-(x1-x2)²).

2. NAS Search

The RNN controller used to search over large spaces

RNN controller, as in NASNet, is used to select the optimal operations in the large search space.

  • At each timestep, the controller predicts a single component of the activation function. The prediction is fed back to the controller in the next timestep, and this process is repeated until every component of the activation function is predicted. The predicted string is then used to construct the activation function.
  • A “child network” with the candidate activation function is trained on some task, such as image classification on CIFAR-10. After training, the validation accuracy of the child network is recorded and used to update the search algorithm.
  • ResNet-20 is used as child on CIFAR-10.
The top novel activation functions found by the searches. Separated into two diagrams for visual clarity.
  • The above figure plots the top performing novel activation functions found by the searches.
  • ResNet, WRN, and DenseNet are used for evaluation.
  • Complicated activation functions consistently underperform simpler activation functions, potentially due to an increased difficulty in optimization.

The best performing activation functions can be represented by 1 or 2 core units.

  • Functions that use division tend to perform poorly because the output explodes when the denominator is near 0.
  • Six of the eight activation functions successfully generalize. Of these six activation functions, all match or outperform ReLU on ResNet-164.

Furthermore, two of the discovered activation functions, xσ(βx) and max(x,σ(x)), consistently match or outperform ReLU on all three models.

Finally, xσ(βx), which called Swish, is chosen, where σ is sigmoid function and β and is either a constant or a trainable parameter.

3. Experimental Results

3.1. Swish Benchmarking

The number of models on which Swish outperforms, is equivalent to, or underperforms each baseline activation function
  • Swish is benchmarked against ReLU and a number of recently proposed activation functions on challenging datasets, and it is found that Swish matches or exceeds the baselines on nearly all tasks.

3.2. CIFAR

Left: CIFAR-10 Accuracy, Right: CIFAR-100 Accuracy
  • Swish is with a trainable β and Swish-1 is with a fixed β=1.

Swish and Swish-1 consistently matches or outperforms ReLU on every model for both CIFAR-10 and CIFAR-100.

3.3. ImageNet

Left: Training curves of Mobile NASNet-A on ImageNet, Right: Mobile NASNet-A on ImageNet
Left: Inception-ResNet-v2 on ImageNet, Right: MobileNet on ImageNet
Left: Inception-v3 on ImageNet, Right: Inception-v4 on ImageNet

The above figure and 5 tables shows the strong performance of Swish.

3.4. WMT 2014 English→German Machine Translation

BLEU score of a 12 layer Transformer on WMT English→German
  • Swish outperforms or matches the other baselines on machine translation.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.