Review — AS-MLP: An Axial Shifted MLP Architecture for Vision

Pure MLP With Axial Shift, Comparable With Swin Transformer

  • Pure MLP architecture is designed, with the design of the receptive field size and dilation of blocks.
  • It is the first work to apply MLP-based architecture to object detection.

Outline

  1. Axial Shifted MLP (AS-MLP)
  2. AS-MLP Block
  3. Complexity & Sampling Location Comparisons
  4. Experimental Results

1. Axial Shifted MLP (AS-MLP)

AS-MLP-T: A tiny version of the overall Axial Shifted MLP (AS-MLP) architecture

1.1. Overall Architecture for AS-MLP-T

  • Given an RGB image I, AS-MLP performs the patch partition operation to multiple patch tokens with the patch size of 4×4, the combination of all tokens has the size of 48×H/4×W/4.
  • AS-MLP has four stages in total and there are different numbers of AS-MLP blocks in different stages. The final output feature will be used for image classification.
  • In Stage 1, a linear embedding and the AS-MLP blocks are adopted for each token. The output has the dimension of C×H/4×W/4, where C is the number of channels.
  • Stage 2 first performs patch merging on the features outputted from the previous stage, which groups the neighbor 2×2 patches to obtain a feature with the size of 4C×H/8×W/8 and then a linear layer is adopted to warp the feature size to 2C×H/8×W/8, followed by the cascaded AS-MLP blocks.
  • Stage 3 and Stage 4 have similar structures to Stage 2.

2.5. AS-MLP Variants

  • There are AS-MLP-Tiny (AS-MLP-T), AS-MLP-Small (AS-MLP-S), AS-MLP-Base (AS-MLP-B), and AS-MLP (mobile) respectively.
  • AS-MLP-T: C=96, the number of blocks in 4 stages = {2, 2, 6, 2}.
  • AS-MLP-S: C=96, the number of blocks in 4 stages = {2, 2, 18, 2}.
  • AS-MLP-B: C=128, the number of blocks in 4 stages = {2, 2, 18, 2}.
  • AS-MLP (mobile): C=64, the number of blocks in 4 stages = {2, 2, 2, 2}.

2. AS-MLP Block

(a) shows the structure of the AS-MLP block; (b) shows the horizontal shift, where the arrows indicate the steps, and the number in each box is the index of the feature.

2.1. AS-MLP Block Structure

  • It mainly consists of the Norm layer, Axial Shift operation, MLP, and residual connection.
  • In the Axial Shift operation, the channel projection, vertical shift, and horizontal shift are utilized to extract features, where the channel projection maps the feature with a linear layer.

2.2. Axial-Shift Example

  • The input has the dimension of C×h×w. For convenience, h is omitted and C=3, w=5 are used in the figure (b).
  • When the shift size is 3, the input feature is split into three parts and they are shifted by {-1, 0, 1} units along the horizontal direction, respectively.
  • Zero padding is performed in the gray area.
  • After that, the features in the dashed box will be taken out and used for the next channel projection.

3. Complexity & Sampling Location Comparisons

3.1. Complexity Comparisons

  • In the Transformer-based architecture, the multi-head self-attention (MSA) is usually adopted.
  • In Swin Transformer, Window MSA (W-MSA) is used, with window size of M.
  • In AS-MLP, there is only Axially Shift (AS) the feature from the previous layer, which does not require any multiplication and addition operations.
  • The time cost of Axial Shift is very low and almost irrelevant to the shift size.
  • Each Axial shift operation in Figure (a) only has four channel projection operations, which has the computation complexity 4hwC².
  • The complexities of MSA, W-MSA and AS are as follows:
  • Therefore, the AS-MLP architecture has slightly less complexity than Swin Transformer.

3.2. Sampling Locations

The different sampling locations of convolution, Swin Transformer, MLP-Mixer, and AS-MLP
  • Unlike MLP-Mixer, AS-MLP pays more attention to the local dependencies through axial shift of features and channel projection.

4. Experimental Results

4.1. ImageNet

The experimental results of different networks on ImageNet-1K
  • e.g., AS-MLP-S obtains higher top-1 accuracy (83.1%) with fewer parameters than Mixer-B/16 (76.4%) and ViP-Medium/7 (82.7%).
  • e.g., AS-MLP-B (83.3%) vs. Swin-B (83.3%).
The result comparisons of the mobile setting.

4.2. Ablation Studies

(Left) The impacts of the different configurations of the AS-MLP architecture. d.r. means dilation rate (Right) The impacts of the different connection types. ‘→’ means serial and ‘+’ means parallel.
  • All ablations are conducted based on the AS-MLP-T.
  • (Left) Different Configurations: Three findings as follows:
  1. ‘Zero padding’ is more suitable for the design of AS-MLP block than other padding methods.
  2. Increasing the dilation rate slightly reduces the performance of AS-MLP, which is consistent with CNN-based architecture. Dilation is usually used for semantic segmentation rather than image classification.
  3. When expanding the shift size, the accuracy will increase first and then decrease. When shift size is 9, the network pays too much attention to the global dependencies, thus neglecting the extraction of local features, which leads to lower accuracy.
  • (Right) Connection Type: Parallel connection consistently outperforms serial connection in terms of different shift sizes.
The impact of AS-MLP block
  • Here five baselines are designed: i) Global-MLP; ii) Axial-MLP; iii) Window-MLP; iv) shift size (5, 1); v) shift size (1, 5).
  • The first three baselines are designed from the perspective of how to use MLP for feature fusion at different positions, and the latter two are designed from the perspective of the axial shift in a single direction.

4.3. Object Detection & Instance Segmentation

The object detection and instance segmentation results of different backbones with 3x schedule on the COCO val2017 dataset
Visualization Results

4.4. Semantic Segmentation

The semantic segmentation results of different backbones on the ADE20K validation set
  • With slightly lower FLOPs, AS-MLP-T achieves better result than Swin-T (46.5 vs. 45.8 MS mIoU).
  • For the large model, UPerNet + Swin-B has 49.7 MS mIoU with 121M parameters and 1188 GFLOPs, and UPerNet + AS-MLP-B has 49.5 MS mIoU with 121M parameters and 1166 GFLOPs.
Visualization Results

4.5. Attended Areas

The visualization of features from Swin Transformer and AS-MLP
  • The first column shows the image from ImageNet, and the second column shows the activation heatmap of the last layer of Swin Transformer (Swin-B).
  • The third, fourth, and fifth columns respectively indicate the response after the horizontal shift (AS-MLP (h)), the vertical shift (AS-MLP (v)) and the combination of both in the last layer of AS-MLP (AS-MLP-B).

Reference

1.1. Image Classification

My Other Previous Paper Readings

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store