Brief Review — Rectifier Nonlinearities Improve Neural Network Acoustic Models

Leaky ReLU, Converge Slightly Faster Than ReLU

Sik-Ho Tsang
3 min readNov 15, 2022

Rectifier Nonlinearities Improve Neural Network Acoustic Models,
Leaky ReLU, by Stanford University,
2013 ICML, Over 6000 Citations (Sik-Ho Tsang @ Medium)
Acoustic Model, Activation Function, ReLU

  • Leaky ReLU, with small negative values as output when input is smaller than 0.
  • This is a paper from Andrew Ng research group.

Outline

  1. Leaky ReLU
  2. Results

1. Leaky ReLU

1.1. Tanh

Tanh (Figure from https://blog.csdn.net/qq_29831163/article/details/89887655)
  • The hyperbolic tangent (tanh) is as below:
  • where σ() is the tanh function, w(i) is the weight vector for the i-th hidden unit, and x is the input.

However, Tanh can suffer from the vanishing gradient problem.

1.2. ReLU

ReLU (Figure from https://blog.csdn.net/qq_29831163/article/details/89887655)
  • Rectified Linear Unit (ReLU) is as shown above and equated below:
  • When the output is above 0, its partial derivative is 1. Thus vanishing gradients do not exist.

However, we might expect learning to be slow whenever the unit is not active.

1.3. Leaky ReLU

Leaky ReLU (Figure Modified from https://blog.csdn.net/qq_29831163/article/details/89887655)
  • Leaky ReLU allows for a small, non-zero gradient when the unit is saturated and not active:

Thus, we might expect the learning is faster.

2. Results

Results for DNN systems in terms of frame-wise error metrics on the development set as well as word error rates (%) on the Hub5 2000 evaluation sets.
  • LVCSR experiments are performed on the 300 hour Switchboard conversational telephone speech corpus (LDC97S62).
  • DNNs with 2, 3, and 4 hidden layers are trained for all nonlinearity types.
  • The output layer is a standard softmax classifier, and cross entropy with no regularization serves as the loss function.

DNNs with ReLU and Leaky ReLU produce 2% absolute reductions in word error rates over Tanh ones.

Both the ReLU and Leaky ReLU networks perform similarly. During training, it is observed Leaky ReLU DNNs converge slightly faster.

Leaky ReLU later is used in many other domains.

Reference

[2013 ICML] [Leaky ReLU]
Rectifier Nonlinearities Improve Neural Network Acoustic Models

2.1. Language Model / Sequence Model

(Some are not related to NLP, but I just group them here)

1991 2013 [Leaky ReLU] … 2020 [ALBERT] [GPT-3] [T5] [Pre-LN Transformer] [MobileBERT] [TinyBERT] [BART] [Longformer] [ELECTRA] [Megatron-LM]

My Other Previous Paper Readings

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.