Brief Review — GLU Variants Improve Transformer

GEGLU & SwiGLU, Better Activation Functions for Transformer

3 min readMar 12, 2023

GLU Variants Improve Transformer,
ReGLU, GEGLU & SwiGLU, by Google
2020 arXiv v1, Over 70 Citations (Sik-Ho Tsang @ Medium)
NLP, NMT, LLM, Language Model, Transformer, GLU, T5
2.1. Language Model
1991 … 2021 [Performer] [gMLP] [Roformer] [PPBERT] [DeBERTa] [DeLighT] [Transformer-LS] [R-Drop] 2022 [GPT-NeoX-20B] [InstructGPT] [GLM]
==== My Other Paper Readings Are Also Over Here ====

Previously, GCNN proposed GLU, which consists of the component-wise product of two linear projections, one of which is first passed through a sigmoid function.
GLU variants, i.e. ReGLU, GEGLU & SwiGLU, are proposed to be used in the feed-forward sublayers of the Transformer.
This is an short tech report with few pages only.

Outline

Preliminaries
Proposed GLU Variants
Results

1. Preliminaries

1.1. Transformer

In Transformer, a rectified-linear (ReLU) activation function applied between the two linear transformations:

In T5, a version with no bias is used:

Some other approaches propose to use GELU and Swish:

1.2. Gated Linear Units (GLU)

GLU, a neural network layer defined as the component-wise product of two linear transformations of the input, one of which is sigmoid-activated. They also suggest omitting the activation, which they call a “bilinear” layer:

2. Proposed GLU Variants

Three GLU variants are proposed:

They are used in FFN. Again, we omit the bias terms:

T5 is used as baseline. The encoder and decoder each consist of 12 layers, with dmodel=768. For the attention layers, h=12 and dk=dv=64. The FFN layers have hidden size dff=3072.
As shown in the equations, three weight matrices are used instead of two, the hidden layer is reduced to dff=2048, so as to maintain the same parameter and operation counts as the base model.
Apart from T5, no Dropout is used for pretraining, which is shown to have better results.
During fine-tuning, a Dropout rate of 0.1 on the layer outputs, feed-forward hidden-layers and attention weights. The embedding matrices are fixed.