Review — ResT: An Efficient Transformer for Visual Recognition

ResT / ResTv1, Design Efficient Multi-Head Self-Attention (EMSA) Using Depth-Wise Convolution

Sik-Ho Tsang
5 min readOct 27, 2022

ResT: An Efficient Transformer for Visual Recognition,
ResT, ResTv1
, by Nanjing University,
2021 NeurIPS, Over 60 Citations (Sik-Ho Tsang @ Medium)
Image Classification, Vision Transformer, ViT, Transformer

  • Standard Transformer tackles fixed-resolution images only. ResT is proposed, which has 3 advantages:
  1. A memory-efficient multi-head self-attention is built, which compresses the memory by a simple depth-wise convolution, and projects the interaction across the attention-heads dimension while keeping the diversity ability of multi-heads.
  2. Positional encoding is constructed as spatial attention, which is more flexible and can tackle with input images of arbitrary size without interpolation or fine-tune.
  3. Instead of the straightforward tokenization at the beginning of each stage, the patch embedding is designed as a stack of overlapping convolution operation with stride on the token map.


  1. ResT Model Architecture
  2. Efficient Multi-Head Self-Attention (EMSA), Patch Embedding, & Pixel Attention (PA) as Position Encoding
  3. Experimental Results

1. ResT Model Architecture

Examples of backbone blocks. Left: A standard ResNet Bottleneck Block. Middle: A Standard Transformer Block. Right: The proposed Efficient Transformer Block
ResT Model Architecture

ResT shares exactly the same pipeline as ResNet, a stem module applied followed by four stages. Each stage consists of three components, 1) one patch embedding module (or stem module), 2) one positional encoding module, and 3) a set of L efficient Transformer blocks.

  • At the beginning of each stage, the patch embedding module is adopted to reduce the resolution of the input token and expanding the channel dimension.
  • The positional encoding module is fused to restrain position information and strengthen the feature extracting ability of patch embedding.

2. Efficient Multi-Head Self-Attention (EMSA), Patch Embedding, & Pixel Attention (PA) as Position Encoding

2.1. Efficient Multi-Head Self-Attention (EMSA)

  • Original MSA has two shortcomings: (1) The computation scales quadratically with dm or n according to the input token. (2) Each head in MSA only responsible for a subset of embedding dimensions.
Efficient Multi-Head Self-Attention (EMSA)

To compress memory, the 2D input token is reshaped into 3D token, and then is fed to a depth-wise convolution (Conv) operation to reduce the height and width dimension by a factor s.

To restore this diversity ability, Instance Normalization (IN) is added for the dot product matrix (after Softmax).

  • Finally, the output values of each head are then concatenated and linearly projected to form the final output.
  • FFN is added after EMSA for feature transformation and non-linearity.

2.2. Patch Embedding

A simple but effective way, i.e, stacking three 3×3 standard convolution layers (all with padding 1) with stride 2, stride 1, and stride 2, respectively.

2.3. Pixel Attention (PA) as Position Encoding

  • In ViT, a set of learnable parameters are added into the input tokens to encode positions.
  • The length of positions is exactly the same as the input tokens length, which limits the application scenarios.

A simple yet effective spatial attention module calling Pixel Attention (PA) is use to encode positions. Specifically, PA applies a 3×3 depth-wise convolution (with padding 1) operation to get the pixel-wise weight and then scaled by a sigmoid function σ.

2.4. ResT Model Variants

ResT Model Variants for ImageNet-1k.
  • Lite, Small, Base, Large versions are developed.

3. Experimental Results

3.1. ImageNet-1k

Comparison with state-of-the-art backbones on ImageNet-1k benchmark

The proposed ResT achieves significant improvement by a large margin.

  • For example, for smaller models, ResT noticeably surpass the counterpart PVT architectures with similar complexities: +4.5% for ResT-Small (79.6%) over PVT-T (75.1%).
  • For larger models, ResT also significantly outperform the counterpart Swin architectures with similar complexities: +0.3% for ResT-Base (81.6%) over Swin-T (81.3%), and +0.3% for ResT-Large (83.6%) over Swin-S(83.3%) using 224×224 input.
  • Compared with RegNet, the ResT with similar model complexity also achieves better performance.

3.2. MS COCO

Object detection performance on the COCO val2017 split using the RetinaNet framework.
  • For smaller models, ResT-Small is +3.6 box AP higher (40.3 vs. 36.7) than PVT-T with a similar computation cost.
  • For larger models, our ResT-Base surpassing the PVT-S by +1.6 box AP.
Object detection and instance segmentation performance on the COCO val2017 split using Mask RCNN framework.
  • Rest-Small exceeds PVT-T by +2.9 box AP and +2.1 mask AP on the COCO val2017 split.
  • As for larger models, ResT-Base brings consistent +1.2 and +0.9 gains over PVT-S in terms of box AP and mask AP, with slightly larger model size.

3.3. Ablation Study

Left: Comparison of various stem modules on ResT-Lite. Right: Comparison of different reduction strategies of EMSA on ResT-Lite.
  • Left: The stem module in the proposed ResT is more effective than that in PVT and ResNet: +0.92% and +0.64% improvements in terms of Top-1 accuracy, respectively.
  • Right: As can be seen, average pooling achieves slightly worse results (-0.24%) compared with the original Depth-wise Conv2d, while the results of the Max Pooling strategy are the worst.
Left: Ablation study results on the important design elements of EMSA on ResT-Lite. Right: various positional encoding (PE) strategies on ResT-Lite.
  • Left: Without IN, the Top-1 accuracy is degraded by 0.9%. In addition, the performance drops 1.16% without the convolution operation and IN.
  • Right: PA mode significantly surpasses others, achieving 0.84% Top-1 accuracy improvement.

Later, ResTv2 is invented.


[2021 NeurIPS] [ResT]
ResT: An Efficient Transformer for Visual Recognition

1.1. Image Classification

19892021 [ResT] … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.