Review — SAN: Exploring Self-attention for Image Recognition

SAN, Self-Attention Networks

Sik-Ho Tsang
5 min readAug 23, 2022

Exploring Self-attention for Image Recognition
, by CUHK, and Intel Labs
2020 CVPR, Over 300 Citations (Sik-Ho Tsang @ Medium)
Image Classification, Self-Attention

  • Two forms of self-attention are considered.
  • One is pairwise self-attention, which generalizes standard dot-product attention and is fundamentally a set operator.
  • The other is patchwise self-attention, which is strictly more powerful than convolution.


  1. Pairwise Self-Attention
  2. Patchwise Self-Attention
  3. Self-Attention Network (SAN) Variants
  4. Experimental Results

1. Pairwise Self-Attention

1.1. Definitions

  • The pairwise self-attention has the following form:
  • where is the Hadamard product (i.e. element-wise multiplication), i is the spatial index of feature vector xi (i.e., its location in the feature map), and R(i) is the local footprint of the aggregation.
  • And α is decomposed as follows:
  • where the relation function δ outputs a single vector that represents the features xi and xj. The function γ (dimension mapping) then maps this vector into a vector that can be combined with β(xj).
  • Multiple forms for the relation function δ are explored:
  • φ and ψ are trainable transformations such as linear mappings, and have matching output dimensionality.

1.2. Position Encoding

  • The horizontal and vertical coordinates along the feature map are first normalized to the range [-1, 1] in each dimension. These normalized two-dimensional coordinates are then passed through a trainable linear layer, which can map them to an appropriate range for each layer in the network.
  • For each pair (i, j) such that j R(i), the relative position information is encoded by calculating the difference pi pj. The output of (xi, xj) is augmented by concatenating [pi pj] prior to the mapping.

2. Patchwise Self-Attention

  • Patchwise self-attention has the following form:
  • where xR(i) is the patch of feature vectors in the footprint R(i). (xR(i)) is a tensor of the same spatial dimensionality as the patch xR(i).
  • α(xR(i))j is the vector at location j in this tensor, corresponding spatially to the vector xj in xR(i).

Unlike pairwise self-attention, patchwise self-attention is no longer a set operation with respect to the features xj.

  • α is decomposed as follows:
  • where γ is for dimension mapping.
  • Multiple forms for the relation function δ are explored:
  • where φ and ψ are trainable transformations

3. Self-Attention Network (SAN) Variants

3.1. Self-Attention Block

The proposed self-attention block. C is the channel dimensionality
  • Residual block, originated in ResNet, is used.
  • Within the main path, the input feature tensor (channel dimensionality C) is passed through two processing streams.
  • The left stream evaluates the attention weights α by computing the function δ (via the mappings φ and ψ) and a subsequent mapping γ.
  • The right stream applies a linear transformation β.

3.2. SAN Variants

Self-attention networks for image recognition
  • The backbone of SAN has five stages, yielding a resolution reduction factor of 32.
  • Consecutive stages are bridged by transition layers that reduce spatial resolution and expand channel dimensionality. The transition comprises a batch normalization layer, a ReLU, 2×2 max pooling with stride 2, and a linear mapping.
  • The output of the last stage is processed by a classification layer that comprises global average pooling, a linear layer, and a softmax.
  • The local footprint R(i) controls the amount of context gathered by a self-attention operator. The footprint size is to 7×7 for the last four stages of SAN. The footprint is set to 3×3 in the first stage due to the high resolution of that stage and the consequent memory consumption.

SAN10, SAN15, and SAN19 are designed, which are in rough correspondence with ResNet26, ResNet38, and ResNet50. The number X in SANX refers to the number of self-attention blocks.

Comparison with convolution and scalar attention
  • The convolution does not adapt to the content of the image. Scalar attention produces scalar weights that do not vary along the channel dimension.

The proposed operators efficiently compute attention weights that adapt across both spatial dimensions and channels.

  • All models are trained from scratch for 100 epochs with minibatch size 256 on 8 GPUs.

4. Experimental Results

4.1. Comparison With ResNet

Comparison of self-attention networks and convolutional residual networks on ImageNet classification
  • The pairwise models match or outperform the convolutional baselines, with similar or lower parameter and FLOP budgets.

The patchwise models perform even better.

4.2. Relation Function

Controlled comparison of different relation functions on the val-split set

For pairwise self-attention, summation, subtraction, and Hadamard product achieve similar accuracy.

For patchwise self-attention, concatenation achieves slightly higher accuracy than star-product and clique-product.

4.3. Mapping Function

Controlled comparison of different mapping functions on the val-split set. L and R denote Linear and ReLU layers, respectively

For pairwise models, using two linear layers yields the highest accuracy.

For patchwise models, different settings yield similar accuracy.

4.4. Transformation Functions

Controlled evaluation of the use of distinct transformation functions

For φ=ψ=β, with r1=r2=4, yields comparable accuracy to φ=ψβ but at higher FLOP counts.

4.5. Footprint Size

Controlled assessment of the impact of footprint size

The accuracy initially increases with footprint size and then saturates.

4.6. Position Encoding

The importance of position encoding in pairwise self-attention

Without position encoding, top-1 accuracy drops by 5 percentage points.

4.7. Zero-Shot Generalization to Rotated Images

Robustness of trained networks to rotation and flipping of images at test time
  • ImageNet images from the val-original set are rotated and flipped in one of four ways.

Pairwise self-attention models are more robust to this kind of manipulation than convolutional networks.

4.8. Adversarial Attacks

Robustness of trained networks to adversarial attacks on the val-original set

The self-attention models are much more robust than convolutional networks.

Self-attention is introduced that efficiently adapts weights across both spatial dimensions and channels.


[2020 CVPR] [SAN]
Exploring Self-attention for Image Recognition

1.1. Image Classification

2020 [SAN] … 2022 [ConvNeXt] [PVTv2]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.