Review — Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?

sMLPNet, On Par With or Even Better Than Swin Transformer

  • An attention-free network called sMLPNet is proposed, which is based on the existing MLP-based vision models.
  • Specifically, the MLP module in the token-mixing step is replaced with a novel sparse MLP (sMLP) module. For 2D image tokens, sMLP applies 1D MLP along the axial directions and the parameters are shared among rows or columns.
  • By sparse connection and weight sharing, sMLP module significantly reduces the number of model parameters and computational complexity, avoiding over-fitting.

Outline

  1. sMLPNet Overall Architecture
  2. Sparse MLP Module
  3. Ablation Studies
  4. SOTA Comparisons

1. sMLPNet Overall Architecture

(a) The overall multi-stage architecture of the sMLPNet; (b) The token mixing module.

1.1. Input

  • Similar to ViT, MLP-Mixer, and recent Swin Transformer, an input RGB image with spatial resolution H×W is divided into non-overlapping patches by a patch partition module.
  • A small patch size of 4×4 is adopted at the first stage of the network. Each patch is first reshaped into a 48-dimensional vector, and then mapped by a linear layer to a C-dimensional embedding.

1.2. Backbone

  • The entire network is comprised of four stages. Except for the first stage, which starts with a linear embedding layer, other stages start with a patch merging layer which reduces the spatial dimension by 2×2 and increases the channel dimension by 2 times.
  • The patch merging layer is simply implemented by a linear layer which takes the concatenated features of each 2×2 neighboring patches as input and outputs the features of the merged patch.
  • Then, the new image tokens are passed through a token-mixing module (Figure (b)) and a channel-mixing module. These two modules do not change the data dimensions.

1.3. Token-Mixing Module

1.4. Channel-Mixing Module

  • The channel-mixing module is implemented by an MLP or so called feed-forward network (FFN), in exactly the same way as that in MLP-Mixer.
  • The FFN is composed of two linear layers separated by a GELU.

2. Sparse MLP Module

2.1. Conceptual Idea

(a) MLP. (b) sMLP.
  • (a) MLP: The token in dark orange interacts with all the other tokens in a single MLP layer.
  • (The cross-shaped interaction is similar to the cross-shaped attention in CSwin Transformer.)

2.2. Implementation

Structure of the proposed sMLP block.
  • Specifically, let Xin of size H×W×C denote the collection of input tokens.
  • In the horizontal mixing path, the data tensor is reshaped into HC×W, and a linear layer with weights W_W of size W×W is applied to each of the HC rows to mix information.
  • Similar operation is applied in the vertical mixing path and the linear layer is characterized by weights W_H of size H×H.
  • Finally, the output from the three paths are fused together:
  • (The cross-shaped interaction implementation is similar to the cross-shaped attention implementation in CSwin Transformer.)

2.3. Complexity Analysis

  • The complexity of one sMLP module is:
  • While the complexity of token mixing part in MLP-Mixer is:

2.4. Model Variants

  • Three variants of our model, called sMLPNet-T, sMLPNet-S sMLPNet-B to match with the model size of Swin-T, Swin-S, and Swin-B, respectively. The expansion parameter in the FFN for channel mixing is α=3 by default. The architecture hyper-parameters of these models are:

3. Ablation Studies

Ablation study on the effects of local and global modeling using the tiny model (α=2).
  • The DWConv operation is extremely lightweight. When removing it from sMLPNet (Global only), the model size only changes from 19.2M to 19.1M and the FLOPs only decrease by 0.1B. However, the image recognition accuracy significantly drops to 80.6%.
Ablation study on the effects of sMLP using sMLPNet-B (α=3) as the base model.
  • Authors start to remove sMLP from stage 1 until stage 4.
Comparison of different fusion methods.
  • Compared to baseline, which has 48.6M parameters and 10.3B FLOPs, the two alternative fusion methods bring much fewer parameters and FLOPs.
Ablation study on the design of the branches in the sMLP module.
  • Sequential processing is tried on the Sparse MLP Module.
Ablation study on the effects of multi-stage architecture.
  • A tiny sMLPNet model (* means α=2) is taken. And all the sMLP blocks in stage 2, 3, and 4 are replaced with the normal MLP blocks. The sMLP blocks in stage 1 is replaced by DWConv, as MLP blocks are too heavy to be used in stage 1. This is referred to as the multi-stage MLP model.

4. SOTA Comparisons

Comparing the proposed sMLPNet with state-of-the-art vision models.
  • In particular, sMLPNet-T achieves 81.9% top-1 accuracy, which is the highest among the existing models with FLOPs fewer than 5B.
  • The performance of sMLPNet-B is also very impressive. It achieves the same top-1 accuracy as Swin-B, but the model size is 25% smaller (65.9M vs. 88M) and the FLOPs are nearly 10% fewer (14.0B vs. 15.4B).
  • Remarkably, there is no sign of over-fitting, which is the main problem that plagues the MLP-like methods, when the model size grows to nearly 66M.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store