Review — CycleMLP: A MLP-like Architecture for Dense Prediction

Sik-Ho Tsang
6 min readJan 16, 2023

--

CycleMLP: A MLP-like Architecture for Dense Prediction,
CycleMLP, by The University of Hong Kong, SenseTime Research, and Shanghai AI Laboratory,
2022 ICLR, Over 80 Citations (

@ Medium)
Image Classification, Object Detection, Semantic Segmentation

  • MLP-Mixer, ResMLP, and gMLP, whose architectures are correlated to image size and thus are infeasible in object detection and segmentatation.
  • CycleMLP has two advantages compared to modern approaches. (1) It can cope with various image sizes. (2) It achieves linear computational complexity to image size by using local windows. In contrast, previous MLPs have O(N2) computations due to fully spatial connections.

Outline

  1. Cycle FC
  2. CycleMLP
  3. Results

1. Cycle FC

Compared Cycle Fully-Connected Layer (Cycle FC) with Channel FC and Spatial FC.

1.1. Motivations of Cycle FC

  • (a) Channel FC: aggregates features in the channel dimension with spatial size ‘1’. It can handle various input scales but cannot learn spatial context.
  • (b) Spatial FC (MLP-Mixer, ResMLP, & gMLP): has a global receptive field in the spatial dimension. However, its parameter size is fixed and it has quadratic computational complexity to image scale.
  • (c) Proposed Cycle Fully-Connected Layer (Cycle FC): has linear complexity the same as channel FC and a larger receptive field than Channel FC.
  • (d)-(f) Three examples of different stepsizes: Orange blocks denote the sampled positions. F denotes the output position. For simplicity, batch dimension is omitted and the feature’s width is set to 1 here.

The motivation behind Cycle FC is to enlarge receptive field of MLP-like models to cope with downstream dense prediction tasks while maintaining the computational efficiency.

1.2. Cycle FC

  • (d) Cycle FC: introduces a receptive field of (SH, SW), where SH and SW are stepsize along with the height and width dimension respectively.
  • The basic Cycle FC operator can be formulated as below:
  • where Wmlp of size Cin×Cout and b of size Cout are parameters of Cycle FC. δi(c) and δj(c) are the spatial offset of the two axis on the c-th channel, which are defined as below:
  • (d): illustrates the offsets along two axis when SH=3, that is δj(c)=0 and δi(c)={-1, 0, 1, -1, 0, 1, …} when c=0, 1, 2, …, 8.
  • (e): shows that when SH=H, Cycle FC has a global receptive field.
  • (f): shows that when SH=1, there will be no offset along either axis and thus Cycle FC degrades to Channel FC.

1.3. Comparison with Channel FC & Spatial FC

Comparison of three types of FC operators.

The larger receptive field in return brings improvements on dense prediction tasks like semantic segmentation and object detection as shown in the above table. Meanwhile, Cycle FC still maintains computational efficiency and flexibility on input resolution. Both the FLOPs and the number of parameters are linear to the spatial scale.

1.4. Comparison with MHSA in Transformer

  • Insipred by Cordonnier ICLR’20, a multi-head self-attention (MHSA) layer with Nh heads can be formulated as below, which is similar to a convolution with kernel size:
  • A relationship between Wmlp and Wmhsa can be formulated as follow:
  • The parameter size in Cycle FC is Cin×Cout while Wmhsa is of size K×K×Cin×Cout.

Cycle FC introduce an inductive bias that the weighting matrix in MHSA should be sparse.

2. CycleMLP

Comparison of MLP blocks in details.

2.1. Overall Architecture

  • CycleMLP follows MViT and PVTv2 to adopt an overlapping patch embedding module with the window size 7 and stride 4. These raw patches are further projected to a higher dimension (denoted as C) by a linear embedding layer. Several Cycle FC blocks are sequentially applied.
  • Cycle FC block consists of three parallel Cycle FCs, which have stepsizes SH×SW of 1×7, 7×1, and 1×1. This design is inspired by the factorization of convolution (Inception-v3) and criss-cross attention (CCNet).
  • Then, there is a channel-MLP with two linear layers and a GELU non-linearity in between. A Layer Norm (LN) layer is applied before both parallel Cycle FC layers and channel-MLP modules. A residual connection (ResNet) is applied after each module.
  • At each stage transition, the channel capacity of the processed tokens is expanded while the number of tokens is reduced. And there are 4 stages.

2.2. Model Variants

Instantiations of the CycleMLP with varying complexity
  • Two model zoos following two widely used Transformer architectures, PVT and Swin, are built, as shown above, where Si, Ci, Ei, and Li which represent the stride of the transition, the token channel dimension, the number of blocks, and the expansion ratio respectively at Stage i.
  • Models in PVT-style are named from CycleMLP-B1 to CycleMLP-B5 and in Swin-Style are named as CycleMLP-T, -S, and -B, which represent models in tiny, small, and base sizes.

3. Results

3.1. ImageNet

ImageNet accuracy v.s. model capacity.
Left: ImageNet-1K classification for MLP-like models. Right: Comparison with SOTA models on ImageNet-1K without extra data.

The accuracy-FLOPs tradeoff of CycleMLP consistently outperforms existing MLP-like models.

CycleMLP models achieve comparable performance to Swin Transformer.

  • GFNet has similar performance as CycleMLP on ImageNet-1K classification. However, GFNet is correlated with the input resolution, which may hurt the performance of dense predictions.

3.2. Ablation Study

Left: Ablation on three parallel branches, Right: Stepsize ablation.

Left: The top-1 accuracy drops significantly after removing one of the three parallel branches, especially when discarding the 1×7 or 7×1 branch.

Right: CycleMLP achieves the highest mIoU on ADE20K when stepsize is 7.

Resolution adaptability. Left: Absolute top-1 accuracy; Right: Accuracy difference relative to that tested on 224.

Compared with DeiT and GFNet, CycleMLP is more robust when resolution varies. Furthermore, at higher resolution, the performance drop of CycleMLP is less than GFNet.

3.3. Object Detection & Instance Segmentation

Object detection and instance segmentation on COCO val2017

CycleMLP-based RetinaNet consistently surpasses the CNN-based ResNet, ResNeXt and Transformer-based PVT under similar parameter constraints.

Furthermore, using Mask R-CNN for instance segmentation also demonstrates similar comparison results.

The instance segmentation results of different backbones on the COCO val2017 dataset. Mask R-CNN frameworks are employed

Furthermore, the CycleMLP can achieve a slightly better performance than Swin Transformer.

3.4. Semantic Segmentation

Left: Semantic segmentation on ADE20K val. All models are equipped with Semantic FPN. Right: Effective Receptive Field (ERF)
The semantic segmentation results of different backbones using UPerNet on the ADE20K validation set

CycleMLP outperforms ResNet and PVT significantly with similar parameters.

  • Moreover, compared to Swin Transformer, CycleMLP can obtain comparable or even better performance.
  • Although GFNet achieves similar performance as CycleMLP on ImageNet classification, CycleMLP notably outperforms GFNet on ADE20K.

3.5. Robustness

Robustness on ImageNet-C

Compared with both Transformers (e.g. DeiT and Swin) and existing MLP models (e.g. MLP-Mixer, ResMLP, gMLP), CycleMLP achieves a stronger robustness ability.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.