Review — Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention

Pale Shape Window for Self Attention

  • Conventional global self-attention increases memory quadratically while some of the works suggest to constraint the self-attention window to be localized, which makes the receptive field small.
  • In this paper, a Pale-Shaped self-Attention (PS-Attention), which performs self-attention within a pale-shaped region. Pale Transformer is formed to solve the above problem.

Outline

  1. Pale-Shaped Self-Attention
  2. Pale Transformer
  3. Experimental Results
  4. Ablation Studies

1. Pale-Shaped Self-Attention

Illustration of different self-attention mechanisms in Transformer backbones

1.1. Prior Arts

  • (a) Conventional global self-attention: used in ViT.
  • (b) Shifted window self-attention: used in Swin Transformer. Shuffled window self-attention: used in Shuffle Transformer. Messenger used in MSG-Transformer.
  • (c) Axial self-attention: used in Axial DeepLab.
  • (d) Cross-shaped window self-attention: used in CSWin Transformer.

1.2. Pale-Shaped Self-Attention (PS-Attention)

  • (e) Proposed pale-shaped self-attention: computes self-attention within a pale-shaped region (abbreviating as pale).
  • One pale contains sr interlaced rows and sc interlaced columns, which covers a region containing (sr×w+sc×h-sr×sc) tokens.
  • The number of pales is equal to N=h/sr=w/sc.

1.3. Efficient Parallel Implementation

Efficient Parallel Implementation of Pale-Shaped Self-Attention (PS-Attention)
  • The input feature X of size h×w×c is divided into two independent parts Xr and Xc of size both with h×w×c/2.
  • Then, the self-attention is conducted within each row-wise and column-wise token group, respectively.
  • Following CvT, three separable convolution layers ΦQ, ΦK, and ΦV are used to generate the query, key, and value:
  • where MSA indicates the Multi-head Self-Attention.
  • Finally, the outputs of row-wise and column-wise attention are concatenated along channel dimension, resulting in the final output Y of size h×w×c:
  • Compared to the vanilla implementation of PS-Attention within the whole pale, such a parallel mechanism has a lower computation complexity.
  • The standard global self-attention has a computational complexity of:
  • The proposed PS-Attention under the parallel implementation has a computational complexity of:

1.4. Pale Transformer Block

Pale Transformer Block
  • Pale Transformer block consists of three sequential parts, the conditional position encoding (CPE), as in CPVT, for dynamically generating the positional embedding, the proposed PS-Attention module for capturing contextual information, and the MLP module for feature projection.
  • The forward pass of the l-th block can be formulated as follows:

2. Pale Transformer

2.1. Overall Architecture

The overall architecture of our Pale Transformer
  • The Pale Transformer consists of four hierarchical stages. Each stage contains a patch merging layer and multiple Pale Transformer blocks.
  • The patch merging layer aims to spatially downsample the input features by a certain ratio and expand the channel dimension by twice for a better representation capacity.
  • For fair comparisons, the overlapping convolution for patch merging is used, which is the same as CvT.

2.2. Model Variants

Detailed configurations of Pale Transformer Variants.
  • Three variants of our Pale Transformer, named Pale-T (Tiny), Pale-S (Small), and Pale-B (Base), respectively.
  • All variants have the same depth with [2, 2, 16, 2] in four stages.
  • In each stage of these variants, the pale size is set as sr=sc=Si=7, and the same MLP expansion ratio of Ri=4 is used.
  • Thus, the main differences among Pale-T, Pale-S, and Pale-B lie in the embedding dimension of tokens and the head number for the PS-Attention in four stages, i.e., variants vary from narrow to wide.

3. Experimental Results

3.1. ImageNet-1K

Comparisons of different backbones on ImageNet-1K validation set.
  • All the variants are trained from scratch for 300 epochs on 8 V100 GPUs with a total batch size of 1024.

3.2. ADE20K

Comparisons of different backbones with UPerNet as decoder on ADE20K for semantic segmentation
  • UPerNet is used as decoder. SS: Single-Scale, MS: Multi-Scale.

3.3. COCO

Comparisons on COCO val2017 with Mask R-CNN framework and 1x training schedule for object detection and instance segmentation
  • Besides, the proposed variants also have consistent improvement on instance segmentation, which are +0.5, +0.5, and +0.3 mask mAP higher than the previous best backbone.

4. Ablation Studies

Ablation study for different choices of pale size
Ablation study for different attention modes

Reference

[2022 AAAI] [Pale Transformer]
Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention

1.1. Image Classification

19892022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer]

My Other Previous Paper Readings

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store