Brief Review — GLU Variants Improve Transformer
2.1. Language Model
1991 … 2021 [Performer] [gMLP] [Roformer] [PPBERT] [DeBERTa] [DeLighT] [Transformer-LS] [R-Drop] 2022 [GPT-NeoX-20B] [InstructGPT] [GLM]
==== My Other Paper Readings Are Also Over Here ====
- Previously, GCNN proposed GLU, which consists of the component-wise product of two linear projections, one of which is first passed through a sigmoid function.
- GLU variants, i.e. ReGLU, GEGLU & SwiGLU, are proposed to be used in the feed-forward sublayers of the Transformer.
- This is an short tech report with few pages only.
- Proposed GLU Variants
- In Transformer, a rectified-linear (ReLU) activation function applied between the two linear transformations:
- In T5, a version with no bias is used:
1.2. Gated Linear Units (GLU)
- GLU, a neural network layer defined as the component-wise product of two linear transformations of the input, one of which is sigmoid-activated. They also suggest omitting the activation, which they call a “bilinear” layer:
2. Proposed GLU Variants
- Three GLU variants are proposed:
- They are used in FFN. Again, we omit the bias terms:
- T5 is used as baseline. The encoder and decoder each consist of 12 layers, with dmodel=768. For the attention layers, h=12 and dk=dv=64. The FFN layers have hidden size dff=3072.
- As shown in the equations, three weight matrices are used instead of two, the hidden layer is reduced to dff=2048, so as to maintain the same parameter and operation counts as the base model.
- Apart from T5, no Dropout is used for pretraining, which is shown to have better results.
- During fine-tuning, a Dropout rate of 0.1 on the layer outputs, feed-forward hidden-layers and attention weights. The embedding matrices are fixed.
Authors explicitly say that they offer no explanation as to why these architectures seem to work.