Review — S²-MLP: Spatial-Shift MLP Architecture for Vision

S²-MLP, Pure MLP Architecture, With Spatial Shift Operation

5 min readDec 21, 2022

S²-MLP: Spatial-Shift MLP Architecture for Vision,
S²-MLP, by Baidu Research,
2022 WACV, Over 60 Citations (Sik-Ho Tsang @ Medium)
Image Classification, Vision Transformer, ViT

Prior MLP-Mixer uses pure MLP structures but it cannot achieve as outstanding performance as its CNN and ViT counterparts due to the use of token-mixing MLP.
In this paper, Spatial-Shift MLP (S²-MLP), a novel pure MLP architecture, is proposed, without the use of token-mixing MLP.
Instead, a spatial-shift operation is proposed for achieving the communication between patches. It has a local reception field and is spatial-agnostic. Meanwhile, it is parameter-free and efficient for computation.

Outline

S²-MLP
Results

1. S²-MLP

1.1. Overall Architecture

An image, denoted by I of size W×H×3, is uniformly split into w×h patches, P.
For each patch Pi, it is unfolded into a vector pi and projected into an embedding vector ei through a fully-connected layer followed by a layer normalization:

Then, the embedding vectors go through N stacks of S²-MLP blocks of the same size and structure.
Each spatial-shift block contains four fully-connected layers, two layer-normalization layers, two GELU layers, two skip-connections, and the proposed spatial-shift module.
At the end, the feature maps output by last block are aggregated into a single feature vector through global average pooling, and fed into a fully-connected layer for predicting the label.

1.2. Spatial-Shift Module

For the input tensor, it is split into g=4 thinner tensors along the channel dimension.
For the first group of channels, T1, it is shifted along the wide dimension by +1. In parallel, the second group of channels, T2, is shifted along the wide dimension by −1. Similarly, T3 is shifted along the height dimension by +1 and T4 is shifted along the height dimension by −1:

After spatially shifting, each patch absorbs the visual content from its adjoining patches. The spatial-shift operation is parameter-free and makes the communication between different spatial locations feasible.
With N stacks of S²-MLP blocks, the global visual content will be gradually diffused to every patch.

1.3. Relationship with Depth-Wise Convolution

In fact, the spatial-shift operation is equal to a depthwise convolution with a fixed and group-specific kernel weights:

1.4. Model Variants

**The hyper-parameters, the number of parameters and FLOPs.**

Due to limited computing resources, authors cannot afford the expensive cost of investigating the large and huge models.
They have two settings only: wide and deep.

2. Results

2.1. ImageNet-1K

**Results on ImageNet-1K without extra data.**

S²-MLP has obtained a comparable accuracy with respect to ViT.

S²-MLP cannot achieve as high recognition accuracy as the state-of-the-art Transformer-based vision models such as CaiT, Swin-B and Nest-B.

Compared with MLP-Mixer and FF, S²-MLP-wide achieves a considerably higher accuracy.
Compared with ResMLP-36, S²-MLP-deep achieves higher accuracy.

2.2. CIFAR-10, CIFAR-100, & Car

**The performance of transfer learning on CIFAR10 (C10), CIFAR100 (C100) and Car.**

Using 224×224-scale images, S²-MLP-deep achieves better performance than ViT-B/16 and ViT-L/16 using 384×384-scale images.
Meanwhile, S²-MLP-deep achieves the comparable performance as DeiT-B with considerably fewer FLOPs.
On CIFAR100 and Car datasets, S²-MLP-deep considerably outperforms ResMLP-36.

2.3. Ablation Studies

Depthwise (Table 5): Interestingly, the proposed spatial-shift operation equivalent to 3×3 depthwise convolution, with pre-defined kernel achieves comparable accuracy as the 3×3 depthwise convolution with learned weights from the data.
Expansion Ratio (Table 6): The top-1 accuracy increases from 86.1% to 87.0% as r increases from 1 to 3. Meanwhile, the number of parameters increases from 29M to 57M accordingly. But the accuracy saturates and even turns worse when r surpasses 3.
Hidden Size (Table 7): The top-1 accuracy increases from 79.7% to 87.1% as the hidden size c increases from 192 and 768, and the number of parameters increases from 4.3M to 71M, and FLOPs increases from 0.9G to 20G. Meanwhile, the accuracy saturates when c surpasses 768.
Input Scale (Table 9): When W×H increases from 112×112 to 336×336, the top-1 accuracy improves from 80.6% to 88.2%, the number of parameters keeps unchanged since the network does not change, and the FLOPs also increases from 3.5G to 31G.
Patch Size (Table 10): When p increases from 16 to 32, it reduces FLOPs from 14B to 3.5B. But it also leads to that the top-1 accuracy drops from 87.1% to 81.0%.

Shifting Directions (Table 8): Without shifting, the network performs poorly due to a lack of communications between patches.

Meanwhile, comparing (e) with (f), the horizontal shifting is more useful than the vertical shifting.
Comparing (c) with (e)/(f), it is observed that shifting in two dimensions (both horizontal and vertical) will be helpful than shifting in a single dimension (horizontal or vertical).