Review — SAN: Exploring Self-attention for Image Recognition

SAN, Self-Attention Networks

5 min readAug 23, 2022

Exploring Self-attention for Image Recognition
SAN, by CUHK, and Intel Labs
2020 CVPR, Over 300 Citations (Sik-Ho Tsang @ Medium)
Image Classification, Self-Attention

Two forms of self-attention are considered.
One is pairwise self-attention, which generalizes standard dot-product attention and is fundamentally a set operator.
The other is patchwise self-attention, which is strictly more powerful than convolution.

Outline

Pairwise Self-Attention
Patchwise Self-Attention
Self-Attention Network (SAN) Variants
Experimental Results

1. Pairwise Self-Attention

1.1. Definitions

The pairwise self-attention has the following form:

where ⊙ is the Hadamard product (i.e. element-wise multiplication), i is the spatial index of feature vector xi (i.e., its location in the feature map), and R(i) is the local footprint of the aggregation.
And α is decomposed as follows:

where the relation function δ outputs a single vector that represents the features xi and xj. The function γ (dimension mapping) then maps this vector into a vector that can be combined with β(xj).
Multiple forms for the relation function δ are explored:

φ and ψ are trainable transformations such as linear mappings, and have matching output dimensionality.

1.2. Position Encoding

The horizontal and vertical coordinates along the feature map are first normalized to the range [-1, 1] in each dimension. These normalized two-dimensional coordinates are then passed through a trainable linear layer, which can map them to an appropriate range for each layer in the network.
For each pair (i, j) such that j ∈ R(i), the relative position information is encoded by calculating the difference pi — pj. The output of (xi, xj) is augmented by concatenating [pi — pj] prior to the mapping.

2. Patchwise Self-Attention

Patchwise self-attention has the following form:

where xR(i) is the patch of feature vectors in the footprint R(i). (xR(i)) is a tensor of the same spatial dimensionality as the patch xR(i).
α(xR(i))j is the vector at location j in this tensor, corresponding spatially to the vector xj in xR(i).

Unlike pairwise self-attention, patchwise self-attention is no longer a set operation with respect to the features xj.

α is decomposed as follows:

where γ is for dimension mapping.
Multiple forms for the relation function δ are explored:

where φ and ψ are trainable transformations

3. Self-Attention Network (SAN) Variants

3.1. Self-Attention Block

**The proposed self-attention block.** C is the channel dimensionality

Residual block, originated in ResNet, is used.
Within the main path, the input feature tensor (channel dimensionality C) is passed through two processing streams.
The left stream evaluates the attention weights α by computing the function δ (via the mappings φ and ψ) and a subsequent mapping γ.
The right stream applies a linear transformation β.

3.2. SAN Variants

**Self-attention networks for image recognition**

The backbone of SAN has five stages, yielding a resolution reduction factor of 32.
Consecutive stages are bridged by transition layers that reduce spatial resolution and expand channel dimensionality. The transition comprises a batch normalization layer, a ReLU, 2×2 max pooling with stride 2, and a linear mapping.
The output of the last stage is processed by a classification layer that comprises global average pooling, a linear layer, and a softmax.
The local footprint R(i) controls the amount of context gathered by a self-attention operator. The footprint size is to 7×7 for the last four stages of SAN. The footprint is set to 3×3 in the first stage due to the high resolution of that stage and the consequent memory consumption.

SAN10, SAN15, and SAN19 are designed, which are in rough correspondence with ResNet26, ResNet38, and ResNet50. The number X in SANX refers to the number of self-attention blocks.

**Comparison with convolution and scalar attention**

The convolution does not adapt to the content of the image. Scalar attention produces scalar weights that do not vary along the channel dimension.

The proposed operators efficiently compute attention weights that adapt across both spatial dimensions and channels.

All models are trained from scratch for 100 epochs with minibatch size 256 on 8 GPUs.

4. Experimental Results

4.1. Comparison With ResNet

**Comparison of self-attention networks and convolutional residual networks on ImageNet classification**

The pairwise models match or outperform the convolutional baselines, with similar or lower parameter and FLOP budgets.

The patchwise models perform even better.

4.2. Relation Function

**Controlled comparison of different relation functions on the val-split set**

For pairwise self-attention, summation, subtraction, and Hadamard product achieve similar accuracy.
For patchwise self-attention, concatenation achieves slightly higher accuracy than star-product and clique-product.

4.3. Mapping Function

**Controlled comparison of different mapping functions on the val-split set. L and R denote Linear and** **ReLU** **layers, respectively**

For pairwise models, using two linear layers yields the highest accuracy.
For patchwise models, different settings yield similar accuracy.