Review — S²-MLP: Spatial-Shift MLP Architecture for Vision

S²-MLP, Pure MLP Architecture, With Spatial Shift Operation

Sik-Ho Tsang
5 min readDec 21, 2022


S²-MLP: Spatial-Shift MLP Architecture for Vision,
S²-MLP, by Baidu Research,
2022 WACV, Over 60 Citations (Sik-Ho Tsang @ Medium)
Image Classification, Vision Transformer, ViT

  • Prior MLP-Mixer uses pure MLP structures but it cannot achieve as outstanding performance as its CNN and ViT counterparts due to the use of token-mixing MLP.
  • In this paper, Spatial-Shift MLP (S²-MLP), a novel pure MLP architecture, is proposed, without the use of token-mixing MLP.
  • Instead, a spatial-shift operation is proposed for achieving the communication between patches. It has a local reception field and is spatial-agnostic. Meanwhile, it is parameter-free and efficient for computation.


  1. S²-MLP
  2. Results

1. S²-MLP

S²-MLP Model Architecture

1.1. Overall Architecture

  • An image, denoted by I of size W×H×3, is uniformly split into w×h patches, P.
  • For each patch Pi, it is unfolded into a vector pi and projected into an embedding vector ei through a fully-connected layer followed by a layer normalization:
  • Then, the embedding vectors go through N stacks of S²-MLP blocks of the same size and structure.
  • Each spatial-shift block contains four fully-connected layers, two layer-normalization layers, two GELU layers, two skip-connections, and the proposed spatial-shift module.
  • At the end, the feature maps output by last block are aggregated into a single feature vector through global average pooling, and fed into a fully-connected layer for predicting the label.

1.2. Spatial-Shift Module

  • For the input tensor, it is split into g=4 thinner tensors along the channel dimension.
  • For the first group of channels, T1, it is shifted along the wide dimension by +1. In parallel, the second group of channels, T2, is shifted along the wide dimension by −1. Similarly, T3 is shifted along the height dimension by +1 and T4 is shifted along the height dimension by −1:

After spatially shifting, each patch absorbs the visual content from its adjoining patches. The spatial-shift operation is parameter-free and makes the communication between different spatial locations feasible.

With N stacks of S²-MLP blocks, the global visual content will be gradually diffused to every patch.

1.3. Relationship with Depth-Wise Convolution

  • In fact, the spatial-shift operation is equal to a depthwise convolution with a fixed and group-specific kernel weights:

1.4. Model Variants

The hyper-parameters, the number of parameters and FLOPs.
  • Due to limited computing resources, authors cannot afford the expensive cost of investigating the large and huge models.
  • They have two settings only: wide and deep.

2. Results

2.1. ImageNet-1K

Results on ImageNet-1K without extra data.

S²-MLP has obtained a comparable accuracy with respect to ViT.

  • S²-MLP cannot achieve as high recognition accuracy as the state-of-the-art Transformer-based vision models such as CaiT, Swin-B and Nest-B.

Compared with MLP-Mixer and FF, S²-MLP-wide achieves a considerably higher accuracy.

Compared with ResMLP-36, S²-MLP-deep achieves higher accuracy.

2.2. CIFAR-10, CIFAR-100, & Car

The performance of transfer learning on CIFAR10 (C10), CIFAR100 (C100) and Car.

Using 224×224-scale images, S²-MLP-deep achieves better performance than ViT-B/16 and ViT-L/16 using 384×384-scale images.

Meanwhile, S²-MLP-deep achieves the comparable performance as DeiT-B with considerably fewer FLOPs.

On CIFAR100 and Car datasets, S²-MLP-deep considerably outperforms ResMLP-36.

2.3. Ablation Studies

Ablation Studies
  • Depthwise (Table 5): Interestingly, the proposed spatial-shift operation equivalent to 3×3 depthwise convolution, with pre-defined kernel achieves comparable accuracy as the 3×3 depthwise convolution with learned weights from the data.
  • Expansion Ratio (Table 6): The top-1 accuracy increases from 86.1% to 87.0% as r increases from 1 to 3. Meanwhile, the number of parameters increases from 29M to 57M accordingly. But the accuracy saturates and even turns worse when r surpasses 3.
  • Hidden Size (Table 7): The top-1 accuracy increases from 79.7% to 87.1% as the hidden size c increases from 192 and 768, and the number of parameters increases from 4.3M to 71M, and FLOPs increases from 0.9G to 20G. Meanwhile, the accuracy saturates when c surpasses 768.
  • Input Scale (Table 9): When W×H increases from 112×112 to 336×336, the top-1 accuracy improves from 80.6% to 88.2%, the number of parameters keeps unchanged since the network does not change, and the FLOPs also increases from 3.5G to 31G.
  • Patch Size (Table 10): When p increases from 16 to 32, it reduces FLOPs from 14B to 3.5B. But it also leads to that the top-1 accuracy drops from 87.1% to 81.0%.
Ablation Studies

Shifting Directions (Table 8): Without shifting, the network performs poorly due to a lack of communications between patches.

  • Meanwhile, comparing (e) with (f), the horizontal shifting is more useful than the vertical shifting.
  • Comparing (c) with (e)/(f), it is observed that shifting in two dimensions (both horizontal and vertical) will be helpful than shifting in a single dimension (horizontal or vertical).

Moreover, comparing (a) and (b), we conclude that shifting along four directions is enough.

Many works investigates into MLP, attempting to replace Transformer.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.