Review — AS-MLP: An Axial Shifted MLP Architecture for Vision

Pure MLP With Axial Shift, Comparable With Swin Transformer

7 min readSep 18, 2022

AS-MLP: An Axial Shifted MLP Architecture for Vision
AS-MLP, by ShanghaiTech University, Tencent, Shanghai Engineering Research Center of Intelligent Vision and Imaging & Shanghai Engineering Research Center of Energy Efficient and Custom AI IC
2022 ICLR, Over 40 Citations (Sik-Ho Tsang @ Medium)
Image Classification, MLP

Pure MLP architecture is designed, with the design of the receptive field size and dilation of blocks.
It is the first work to apply MLP-based architecture to object detection.

Outline

Axial Shifted MLP (AS-MLP)
AS-MLP Block
Complexity & Sampling Location Comparisons
Experimental Results

1. Axial Shifted MLP (AS-MLP)

**AS-MLP-T: A tiny version of the overall Axial Shifted MLP (AS-MLP) architecture**

1.1. Overall Architecture for AS-MLP-T

Given an RGB image I, AS-MLP performs the patch partition operation to multiple patch tokens with the patch size of 4×4, the combination of all tokens has the size of 48×H/4×W/4.
AS-MLP has four stages in total and there are different numbers of AS-MLP blocks in different stages. The final output feature will be used for image classification.
In Stage 1, a linear embedding and the AS-MLP blocks are adopted for each token. The output has the dimension of C×H/4×W/4, where C is the number of channels.
Stage 2 first performs patch merging on the features outputted from the previous stage, which groups the neighbor 2×2 patches to obtain a feature with the size of 4C×H/8×W/8 and then a linear layer is adopted to warp the feature size to 2C×H/8×W/8, followed by the cascaded AS-MLP blocks.
Stage 3 and Stage 4 have similar structures to Stage 2.

2.5. AS-MLP Variants

There are AS-MLP-Tiny (AS-MLP-T), AS-MLP-Small (AS-MLP-S), AS-MLP-Base (AS-MLP-B), and AS-MLP (mobile) respectively.
AS-MLP-T: C=96, the number of blocks in 4 stages = {2, 2, 6, 2}.
AS-MLP-S: C=96, the number of blocks in 4 stages = {2, 2, 18, 2}.
AS-MLP-B: C=128, the number of blocks in 4 stages = {2, 2, 18, 2}.
AS-MLP (mobile): C=64, the number of blocks in 4 stages = {2, 2, 2, 2}.

2. AS-MLP Block

**(a) shows the structure of the AS-MLP block; (b) shows the horizontal shift, where the arrows indicate the steps, and the number in each box is the index of the feature.**

2.1. AS-MLP Block Structure

It mainly consists of the Norm layer, Axial Shift operation, MLP, and residual connection.
In the Axial Shift operation, the channel projection, vertical shift, and horizontal shift are utilized to extract features, where the channel projection maps the feature with a linear layer.

Vertical shift and horizontal shift are responsible for the feature translation along the spatial directions.

2.2. Axial-Shift Example

The input has the dimension of C×h×w. For convenience, h is omitted and C=3, w=5 are used in the figure (b).
When the shift size is 3, the input feature is split into three parts and they are shifted by {-1, 0, 1} units along the horizontal direction, respectively.
Zero padding is performed in the gray area.
After that, the features in the dashed box will be taken out and used for the next channel projection.

In the process of both shifts, since the feature performs different shift units, the information from different spatial positions can be combined together.
In the next channel projection operation, information from different spatial locations can fully flow and interact.

3. Complexity & Sampling Location Comparisons

3.1. Complexity Comparisons

In the Transformer-based architecture, the multi-head self-attention (MSA) is usually adopted.
In Swin Transformer, Window MSA (W-MSA) is used, with window size of M.
In AS-MLP, there is only Axially Shift (AS) the feature from the previous layer, which does not require any multiplication and addition operations.
The time cost of Axial Shift is very low and almost irrelevant to the shift size.
Each Axial shift operation in Figure (a) only has four channel projection operations, which has the computation complexity 4hwC².
The complexities of MSA, W-MSA and AS are as follows:

Therefore, the AS-MLP architecture has slightly less complexity than Swin Transformer.

3.2. Sampling Locations

**The different sampling locations of convolution,** **Swin Transformer**, **MLP-Mixer, and AS-MLP**

Unlike MLP-Mixer, AS-MLP pays more attention to the local dependencies through axial shift of features and channel projection.

4. Experimental Results

4.1. ImageNet

**The experimental results of different networks on ImageNet-1K**

AS-MLP outperforms other MLP-based architectures when keeping similar parameters and FLOPs.

e.g., AS-MLP-S obtains higher top-1 accuracy (83.1%) with fewer parameters than Mixer-B/16 (76.4%) and ViP-Medium/7 (82.7%).

Furthermore, it achieves competitive performance compared with Transformer-based architectures.

e.g., AS-MLP-B (83.3%) vs. Swin-B (83.3%).

**The result comparisons of the mobile setting.**

AS-MLP (mobile) significantly exceeds Swin Transformer in the mobile setting (76.05% vs. 75.11%).

4.2. Ablation Studies

(Left) The impacts of the different configurations of the AS-MLP architecture. d.r. means dilation rate (Right) The impacts of the different connection types. ‘→’ means serial and ‘+’ means parallel.

All ablations are conducted based on the AS-MLP-T.
(Left) Different Configurations: Three findings as follows:

‘Zero padding’ is more suitable for the design of AS-MLP block than other padding methods.
Increasing the dilation rate slightly reduces the performance of AS-MLP, which is consistent with CNN-based architecture. Dilation is usually used for semantic segmentation rather than image classification.
When expanding the shift size, the accuracy will increase first and then decrease. When shift size is 9, the network pays too much attention to the global dependencies, thus neglecting the extraction of local features, which leads to lower accuracy.

Therefore, the configuration (shift size = 5, zero padding, dilation rate = 1) in all experiments are used, including object detection and semantic segmentation.

(Right) Connection Type: Parallel connection consistently outperforms serial connection in terms of different shift sizes.

Here five baselines are designed: i) Global-MLP; ii) Axial-MLP; iii) Window-MLP; iv) shift size (5, 1); v) shift size (1, 5).
The first three baselines are designed from the perspective of how to use MLP for feature fusion at different positions, and the latter two are designed from the perspective of the axial shift in a single direction.

AS-MLP block with shift size (5, 5) outperforms other baselines.

4.3. Object Detection & Instance Segmentation

**The object detection and instance segmentation results of different backbones with 3x schedule on the COCO val2017 dataset**

AS-MLP achieves comparable performance with Swin Transformer in the similar resource limitation.

To be specific, Cascade Mask R-CNN + Swin-B achieves 51.9 APb with 145M parameters and 982 GFLOPs, and Cascade Mask R-CNN + AS-MLP-B obtains 51.5 APb with 145M parameters and 961 GFLOPs.

4.4. Semantic Segmentation

**The semantic segmentation results of different backbones on the ADE20K validation set**

With slightly lower FLOPs, AS-MLP-T achieves better result than Swin-T (46.5 vs. 45.8 MS mIoU).
For the large model, UPerNet + Swin-B has 49.7 MS mIoU with 121M parameters and 1188 GFLOPs, and UPerNet + AS-MLP-B has 49.5 MS mIoU with 121M parameters and 1166 GFLOPs.

This shows the effectiveness of AS-MLP architecture in processing the downstream task.

4.5. Attended Areas

**The visualization of features from** **Swin Transformer** **and AS-MLP**

The first column shows the image from ImageNet, and the second column shows the activation heatmap of the last layer of Swin Transformer (Swin-B).
The third, fourth, and fifth columns respectively indicate the response after the horizontal shift (AS-MLP (h)), the vertical shift (AS-MLP (v)) and the combination of both in the last layer of AS-MLP (AS-MLP-B).

AS-MLP can better focus on object regions compared to Swin Transformer.

By axially shifting channels of the feature map, AS-MLP is able to obtain the information flow from different axial directions, which captures the local dependencies.