Review — gMLP: Pay Attention to MLPs

gMLP, for Both Computer Vision & Natural Language Processing

  • gMLP, based on MLPs with gating, is proposed to replace the self-attention in Transformer, which can perform as good as Transformers in key language and vision applications.


  1. gMLP Design Purpose
  2. Spatial Gating Unit (SGU)
  3. Image Classification Results
  4. NLP Results
  5. aMLP: gMLP with Attention Results

1. gMLP Design Purpose

  • gMLP, consists of a stack of L blocks with identical size and structure.
  • Let X be the token representations with sequence length n and dimension d. Each block is defined as:
Shortcuts, normalizations and biases are omitted for brevity
  • where σ is an activation function such as GELU. U and V define linear projections along the channel dimension — the same as those in the FFNs of Transformers.
  • The key ingredient is s(), a layer which captures spatial interactions. (When s is an identity mapping, the above transformation degenerates to a regular FFN.)
  • The overall block layout is inspired by inverted bottlenecks in MobileNetV2, which define s() as a spatial depthwise convolution.
  • Unlike Transformers, the proposed model does not require position embeddings because such information will be captured in s().

2. Spatial Gating Unit (SGU)

2.1. SGU

gMLP Block
  • To enable cross-token interactions, it is necessary for the layer s() to contain a contraction operation over the spatial dimension. The simplistic option would be a linear projection:
  • Unlike self-attention where W(Z) is dynamically generated from Z, the spatial projection matrix W is independent from the input representations.
  • s() is formulated as the output of linear gating:
  • where ⊙ denotes element-wise multiplication. For training stability, it is critical to initialize W as near-zero values and b as ones.
  • It is further found to be effective to split Z into two independent parts (Z1, Z2) along the channel dimension for the gating function and for the multiplicative bypass:
  • Also, normalizing the input to fW,b empirically improves stability of large NLP models.

2.2. Connections to Existing Layers

  • The overall formulation of SGU resembles Gated Linear Units (GLUs) [26, 27, 28] as well as earlier works including Highway Networks [29] and LSTM-RNNs.
  • SGU is also related to Squeeze-and-Excite (SE) blocks in SENet [30] in terms of element-wise multiplication. However, different from SE blocks, SGU does not contain cross-channel projections.

3. Image Classification Results

Architecture specifications of gMLP models for vision.
  • The input and output protocols follow ViT/B16 where the raw image is converted into 16×16 patches at the stem. Regularization is used as in DeiT.
  • 3 gMLP variants are designed for image classification.
ImageNet-1K results without extra data.
  • While gMLPs are competitive with vanilla Transformers, i.e. ViT, their performance is behind the best existing ConvNet models, e.g.: EfficientNet, or hybrid models.
Left: ImageNet accuracy vs model capacity. Right: Spatial projection weights in gMLP-B.
  • Right: Each spatial projection matrix effectively learns to perform convolution with a data-driven, irregular (non-square) kernel shape.

4. NLP Results

MLM validation perplexities of Transformer baselines and four versions of gMLPs.
Visualization of the spatial filters in gMLP learned on the MLM task.
Pretraining and dev-set finetuning results over increased model capacity.
Scaling properties with respect to perplexity and finetuning accuracies.

5. aMLP: gMLP with Attention Results

Hybrid spatial gating unit with a tiny self-attention module.
  • To isolate the effect of self-attention, a hybrid model is designed where a tiny self-attention block is attached to the gating function of gMLP, named as aMLP (“a” for attention).
Transferability from MLM pretraining perpexity to finetuning accuracies on GLUE.
Comparing the scaling properties of Transformers, gMLPs and aMLPs (with 64-d, single-head attention).
Model specifications in the full BERT setup.
  • Using BERT base and large models as basis, gMLP and aMLP are constructed.
Pretraining perplexities and dev-set results for finetuning. “ours” indicates models trained using our setup.
  • On finetuning tasks where gMLPs underperform Transformers, the performance gap tends to narrow as the model capacity increases.


[2021 NeurIPS] [gMLP]
Pay Attention to MLPs

1.1. Image Classification

19892021 [gMLP] … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer]

2.1. Language Model / Sequence Model

(Some are not related to NLP, but I just group them here)

My Other Previous Paper Readings



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store