Brief Review — GLU Variants Improve Transformer

GEGLU & SwiGLU, Better Activation Functions for Transformer

Sik-Ho Tsang
3 min readMar 12, 2023

GLU Variants Improve Transformer,
ReGLU, GEGLU & SwiGLU, by Google
2020 arXiv v1, Over 70 Citations (Sik-Ho Tsang @ Medium)
NLP, NMT, LLM, Language Model, Transformer, GLU, T5

2.1. Language Model
1991 … 2021 [Performer] [gMLP] [Roformer] [PPBERT] [DeBERTa] [DeLighT] [Transformer-LS] [R-Drop] 2022 [GPT-NeoX-20B] [InstructGPT] [GLM]
==== My Other Paper Readings Are Also Over Here ====

  • Previously, GCNN proposed GLU, which consists of the component-wise product of two linear projections, one of which is first passed through a sigmoid function.
  • GLU variants, i.e. ReGLU, GEGLU & SwiGLU, are proposed to be used in the feed-forward sublayers of the Transformer.
  • This is an short tech report with few pages only.

Outline

  1. Preliminaries
  2. Proposed GLU Variants
  3. Results

1. Preliminaries

1.1. Transformer

  • In Transformer, a rectified-linear (ReLU) activation function applied between the two linear transformations:
  • In T5, a version with no bias is used:
  • Some other approaches propose to use GELU and Swish:

1.2. Gated Linear Units (GLU)

  • GLU, a neural network layer defined as the component-wise product of two linear transformations of the input, one of which is sigmoid-activated. They also suggest omitting the activation, which they call a “bilinear” layer:

2. Proposed GLU Variants

  • Three GLU variants are proposed:
  • They are used in FFN. Again, we omit the bias terms:
  • T5 is used as baseline. The encoder and decoder each consist of 12 layers, with dmodel=768. For the attention layers, h=12 and dk=dv=64. The FFN layers have hidden size dff=3072.
  • As shown in the equations, three weight matrices are used instead of two, the hidden layer is reduced to dff=2048, so as to maintain the same parameter and operation counts as the base model.
  • Apart from T5, no Dropout is used for pretraining, which is shown to have better results.
  • During fine-tuning, a Dropout rate of 0.1 on the layer outputs, feed-forward hidden-layers and attention weights. The embedding matrices are fixed.

Authors explicitly say that they offer no explanation as to why these architectures seem to work.

3. Results

Heldout-set log-perplexity for Transformer models on the segment-filling task from T5.

The GEGLU and SwiGLU variants produce the best perplexities.

GLUE Language-Understanding Benchmark (dev).
SuperGLUE Language-Understanding Benchmark (dev).
SQuAD v1.1 (dev).

For 3 downstream tasks, the new GLU-variants perform best on most of the tasks.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.