Brief Review — GLU Variants Improve Transformer
GEGLU & SwiGLU, Better Activation Functions for Transformer
GLU Variants Improve Transformer,
ReGLU, GEGLU & SwiGLU, by Google
2020 arXiv v1, Over 70 Citations (Sik-Ho Tsang @ Medium)
NLP, NMT, LLM, Language Model, Transformer, GLU, T5
2.1. Language Model
1991 … 2021 [Performer] [gMLP] [Roformer] [PPBERT] [DeBERTa] [DeLighT] [Transformer-LS] [R-Drop] 2022 [GPT-NeoX-20B] [InstructGPT] [GLM]
==== My Other Paper Readings Are Also Over Here ====
- Previously, GCNN proposed GLU, which consists of the component-wise product of two linear projections, one of which is first passed through a sigmoid function.
- GLU variants, i.e. ReGLU, GEGLU & SwiGLU, are proposed to be used in the feed-forward sublayers of the Transformer.
- This is an short tech report with few pages only.
- Proposed GLU Variants
- In Transformer, a rectified-linear (ReLU) activation function applied between the two linear transformations:
- In T5, a version with no bias is used:
1.2. Gated Linear Units (GLU)
- GLU, a neural network layer defined as the component-wise product of two linear transformations of the input, one of which is sigmoid-activated. They also suggest omitting the activation, which they call a “bilinear” layer:
2. Proposed GLU Variants
- Three GLU variants are proposed:
- They are used in FFN. Again, we omit the bias terms:
- T5 is used as baseline. The encoder and decoder each consist of 12 layers, with dmodel=768. For the attention layers, h=12 and dk=dv=64. The FFN layers have hidden size dff=3072.
- As shown in the equations, three weight matrices are used instead of two, the hidden layer is reduced to dff=2048, so as to maintain the same parameter and operation counts as the base model.
- Apart from T5, no Dropout is used for pretraining, which is shown to have better results.
- During fine-tuning, a Dropout rate of 0.1 on the layer outputs, feed-forward hidden-layers and attention weights. The embedding matrices are fixed.
Authors explicitly say that they offer no explanation as to why these architectures seem to work.
The GEGLU and SwiGLU variants produce the best perplexities.
For 3 downstream tasks, the new GLU-variants perform best on most of the tasks.