Review — Symmetrical Gaussian Error Linear Units (SGELUs)

Symmetrical Gaussian Error Linear Units (SGELUs)

3 min readFeb 3, 2022

Symmetrical Gaussian Error Linear Units (SGELUs)
SGELU, by Southeast University Nanjing, and Jiangsu Smartwin Electronics Technology Co., Ltd.
2019 arXiv (Sik-Ho Tsang @ Medium)
Image Classification, Activation Function

SGELU is achieved by effectively integrating the property of the stochastic regularizer in the Gaussian Error Linear Unit (GELU) with the symmetrical characteristics.

Outline

SGELU Formulation
Experimental Results

1. SGELU Formulation

The activation function of GELU can be represented by:

where erf() represents the Gaussion error function, that is:

Since the GELU function represents the nonlinearity using the stochastic regularizer on an input, which is the cumulative distribution function derived from the Gaussian error function, it has shown the advantage over other functions, e.g., ReLU, ELU.
However, most activation functions do not fully exploit the negative value. Taking this into account, the advantage of stochastic regularizer is taken on the input and exploit the negative value, and a novel Symmetrical Gaussian Error Linear Unit (SGELU) is proposed, which can be represented by:

in which α represents the hyper-parameter.

**Derivatives of SGELU,** **GELU, ReLU and ELU**

For ReLU, if z (input) is negative, the gradient is zero and thus the weight stops updating.
For ELU, if z is negative, the gradient is positive but with small values. The weight updates up to a bigger value and moves towards to the positive direction with a relatively slow learning rate.
For GELU, if z is negative, the gradient value is then positive or very close to zero if z is “very” negative for most cases, which pushes weight to a smaller value. Finally, the weight stops updating.

SGELU can update its weight symmetrically towards to two directions in both positive and negative half axis. In other words, the function of SGELU is a two-to-one mapping between the input and the output, while the others are a one-to-one mapping.

2. Experimental Results

2.1. MNIST Classification

A fully connected SGELU neural network with =0.1 is trained to compare with a similar network using GELU and LiSHT, each 8-layer, 128-neuron wide neural network is trained for 50 epochs with a batch size of 128, in which the Adam optimizer.