Review — Gaussian Error Linear Units (GELUs)

GELU, Outperforms ReLU and ELU, in CV, NLP and Speech Tasks

  • The GELU nonlinearity weights inputs by their value, rather than gates inputs by their sign as in ReLUs.
  • Performance improvements are obtained across all consideredand .


  1. Gaussian Error Linear Unit (GELU)
  2. Experimental Results

GELU (μ=0, σ=0) vs ReLU vs ELU
  • ReLU deterministically multiplying the input by zero or one and Dropout stochastically multiplying by zero.
  • Since the cumulative distribution function of a Gaussian is often computed with the error function, the Gaussian Error Linear Unit (GELU) is defined as:
  • The above equation is approximated as:
  • or:
  • if greater feedforward speed is worth the cost of exactness.
  • Different (, ) can be used as CDF, but in this paper, (0, 1) is used.

2.1. MNIST Classification

MNIST Classification Results. Left are the loss curves without Dropout, and right are curves with a Dropout rate of 0.5.
  • Fully connected networks with 8-layer, 128 neuron wide neural network are trained for 50 epochs with a batch size of 128.

2.2. MNIST Autoencoder

MNIST Autoencoding Results.
  • A deep autoencoder is trained on MNIST using self-supervised setting.
  • The network is with layers of width 1000, 500, 250, 30, 250, 500, 1000, in order.

2.3. TIMIT Frame Classification

  • Phone recognition with the TIMIT dataset which has recordings of 680 speakers in a noiseless environment.
  • The system is a 5-layer, 2048-neuron wide classifier with 39 output phone labels and a Dropout rate of 0.5.

2.4. CIFAR-10/100 Classification

CIFAR-10 Results.
CIFAR-100 Results.
  • A shallower convolutional neural network of 9-layer is trained to test CIFAR-10. Each curve is a median of three runs.
  • On CIFAR-100, WRN with 40 layers and a widening factor of 4 is trained.
  • Over three runs we obtain the median convergence curves are shown above.



