Review — Swish: Searching for Activation Functions

Swish: f(x)=x • sigmoid(βx)

4 min readFeb 16, 2022

Searching for Activation Functions
Swish, by Google Brain
2018 ICLRW, Over 1600 Citations (Sik-Ho Tsang @ Medium)
Activation Function, Image Classification, Neural Machine Translation, Natural Language Processing, NLP

Search space is defined to search for a better activation function.
Finally, Swish is come up with better performance than ReLU.

Outline

Search Space
NAS Search
Experimental Results

1. Search Space

**An example activation function structure**

A simple search space is designed that composes unary and binary functions to construct the activation function.

The activation function is composed of multiple repetitions of the “core unit”, which consists of two inputs, two unary functions, and one binary function.

**A Set of Unary and Binary Function Candidates**

Unary functions take in a single scalar input and return a single scalar output, such u(x)=x² or u(x)=σ(x).
Binary functions take in two scalar inputs and return a single scalar output, such as b(x1, x2)=x1•x2 or b(x1, x2)=exp(-(x1-x2)²).

2. NAS Search

**The RNN controller used to search over large spaces**

RNN controller, as in NASNet, is used to select the optimal operations in the large search space.

At each timestep, the controller predicts a single component of the activation function. The prediction is fed back to the controller in the next timestep, and this process is repeated until every component of the activation function is predicted. The predicted string is then used to construct the activation function.
A “child network” with the candidate activation function is trained on some task, such as image classification on CIFAR-10. After training, the validation accuracy of the child network is recorded and used to update the search algorithm.
ResNet-20 is used as child on CIFAR-10.

**The top novel activation functions found by the searches. Separated into two diagrams for visual clarity.**

The above figure plots the top performing novel activation functions found by the searches.

ResNet, WRN, and DenseNet are used for evaluation.
Complicated activation functions consistently underperform simpler activation functions, potentially due to an increased difficulty in optimization.

The best performing activation functions can be represented by 1 or 2 core units.

Functions that use division tend to perform poorly because the output explodes when the denominator is near 0.
Six of the eight activation functions successfully generalize. Of these six activation functions, all match or outperform ReLU on ResNet-164.

Furthermore, two of the discovered activation functions, x•σ(βx) and max(x,σ(x)), consistently match or outperform ReLU on all three models.
Finally, x•σ(βx), which called Swish, is chosen, where σ is sigmoid function and β and is either a constant or a trainable parameter.

3. Experimental Results

3.1. Swish Benchmarking

**The number of models on which Swish outperforms, is equivalent to, or underperforms each baseline activation function**

Swish is benchmarked against ReLU and a number of recently proposed activation functions on challenging datasets, and it is found that Swish matches or exceeds the baselines on nearly all tasks.