Brief Review — Rectifier Nonlinearities Improve Neural Network Acoustic Models
Leaky ReLU, Converge Slightly Faster Than ReLU
Rectifier Nonlinearities Improve Neural Network Acoustic Models,
Leaky ReLU, by Stanford University,
2013 ICML, Over 6000 Citations (Sik-Ho Tsang @ Medium)
Acoustic Model, Activation Function, ReLU
- Leaky ReLU, with small negative values as output when input is smaller than 0.
- This is a paper from Andrew Ng research group.
- Leaky ReLU
1. Leaky ReLU
- The hyperbolic tangent (tanh) is as below:
- where σ() is the tanh function, w(i) is the weight vector for the i-th hidden unit, and x is the input.
However, Tanh can suffer from the vanishing gradient problem.
- Rectified Linear Unit (ReLU) is as shown above and equated below:
- When the output is above 0, its partial derivative is 1. Thus vanishing gradients do not exist.
However, we might expect learning to be slow whenever the unit is not active.
1.3. Leaky ReLU
- Leaky ReLU allows for a small, non-zero gradient when the unit is saturated and not active:
Thus, we might expect the learning is faster.
- LVCSR experiments are performed on the 300 hour Switchboard conversational telephone speech corpus (LDC97S62).
- DNNs with 2, 3, and 4 hidden layers are trained for all nonlinearity types.
- The output layer is a standard softmax classifier, and cross entropy with no regularization serves as the loss function.
DNNs with ReLU and Leaky ReLU produce 2% absolute reductions in word error rates over Tanh ones.
Both the ReLU and Leaky ReLU networks perform similarly. During training, it is observed Leaky ReLU DNNs converge slightly faster.
Leaky ReLU later is used in many other domains.
[2013 ICML] [Leaky ReLU]
Rectifier Nonlinearities Improve Neural Network Acoustic Models
2.1. Language Model / Sequence Model
(Some are not related to NLP, but I just group them here)
1991 … 2013 [Leaky ReLU] … 2020 [ALBERT] [GPT-3] [T5] [Pre-LN Transformer] [MobileBERT] [TinyBERT] [BART] [Longformer] [ELECTRA] [Megatron-LM]