Review — SAN: Exploring Self-attention for Image Recognition
SAN, Self-Attention Networks
Exploring Self-attention for Image Recognition
SAN, by CUHK, and Intel Labs
2020 CVPR, Over 300 Citations (Sik-Ho Tsang @ Medium)
Image Classification, Self-Attention
- Two forms of self-attention are considered.
- One is pairwise self-attention, which generalizes standard dot-product attention and is fundamentally a set operator.
- The other is patchwise self-attention, which is strictly more powerful than convolution.
Outline
- Pairwise Self-Attention
- Patchwise Self-Attention
- Self-Attention Network (SAN) Variants
- Experimental Results
1. Pairwise Self-Attention
1.1. Definitions
- The pairwise self-attention has the following form:
- where ⊙ is the Hadamard product (i.e. element-wise multiplication), i is the spatial index of feature vector xi (i.e., its location in the feature map), and R(i) is the local footprint of the aggregation.
- And α is decomposed as follows:
- where the relation function δ outputs a single vector that represents the features xi and xj. The function γ (dimension mapping) then maps this vector into a vector that can be combined with β(xj).
- Multiple forms for the relation function δ are explored:
- φ and ψ are trainable transformations such as linear mappings, and have matching output dimensionality.
1.2. Position Encoding
- The horizontal and vertical coordinates along the feature map are first normalized to the range [-1, 1] in each dimension. These normalized two-dimensional coordinates are then passed through a trainable linear layer, which can map them to an appropriate range for each layer in the network.
- For each pair (i, j) such that j ∈ R(i), the relative position information is encoded by calculating the difference pi — pj. The output of (xi, xj) is augmented by concatenating [pi — pj] prior to the mapping.
2. Patchwise Self-Attention
- Patchwise self-attention has the following form:
- where xR(i) is the patch of feature vectors in the footprint R(i). (xR(i)) is a tensor of the same spatial dimensionality as the patch xR(i).
- α(xR(i))j is the vector at location j in this tensor, corresponding spatially to the vector xj in xR(i).
Unlike pairwise self-attention, patchwise self-attention is no longer a set operation with respect to the features xj.
- α is decomposed as follows:
- where γ is for dimension mapping.
- Multiple forms for the relation function δ are explored:
- where φ and ψ are trainable transformations
3. Self-Attention Network (SAN) Variants
3.1. Self-Attention Block
- Residual block, originated in ResNet, is used.
- Within the main path, the input feature tensor (channel dimensionality C) is passed through two processing streams.
- The left stream evaluates the attention weights α by computing the function δ (via the mappings φ and ψ) and a subsequent mapping γ.
- The right stream applies a linear transformation β.
3.2. SAN Variants
- The backbone of SAN has five stages, yielding a resolution reduction factor of 32.
- Consecutive stages are bridged by transition layers that reduce spatial resolution and expand channel dimensionality. The transition comprises a batch normalization layer, a ReLU, 2×2 max pooling with stride 2, and a linear mapping.
- The output of the last stage is processed by a classification layer that comprises global average pooling, a linear layer, and a softmax.
- The local footprint R(i) controls the amount of context gathered by a self-attention operator. The footprint size is to 7×7 for the last four stages of SAN. The footprint is set to 3×3 in the first stage due to the high resolution of that stage and the consequent memory consumption.
SAN10, SAN15, and SAN19 are designed, which are in rough correspondence with ResNet26, ResNet38, and ResNet50. The number X in SANX refers to the number of self-attention blocks.
- The convolution does not adapt to the content of the image. Scalar attention produces scalar weights that do not vary along the channel dimension.
The proposed operators efficiently compute attention weights that adapt across both spatial dimensions and channels.
- All models are trained from scratch for 100 epochs with minibatch size 256 on 8 GPUs.
4. Experimental Results
4.1. Comparison With ResNet
- The pairwise models match or outperform the convolutional baselines, with similar or lower parameter and FLOP budgets.
The patchwise models perform even better.
4.2. Relation Function
For pairwise self-attention, summation, subtraction, and Hadamard product achieve similar accuracy.
For patchwise self-attention, concatenation achieves slightly higher accuracy than star-product and clique-product.
4.3. Mapping Function
For pairwise models, using two linear layers yields the highest accuracy.
For patchwise models, different settings yield similar accuracy.
4.4. Transformation Functions
For φ=ψ=β, with r1=r2=4, yields comparable accuracy to φ=ψ≠β but at higher FLOP counts.
4.5. Footprint Size
The accuracy initially increases with footprint size and then saturates.
4.6. Position Encoding
Without position encoding, top-1 accuracy drops by 5 percentage points.
4.7. Zero-Shot Generalization to Rotated Images
- ImageNet images from the val-original set are rotated and flipped in one of four ways.
Pairwise self-attention models are more robust to this kind of manipulation than convolutional networks.
4.8. Adversarial Attacks
The self-attention models are much more robust than convolutional networks.
Self-attention is introduced that efficiently adapts weights across both spatial dimensions and channels.
Reference
[2020 CVPR] [SAN]
Exploring Self-attention for Image Recognition
1.1. Image Classification
2020 [SAN] … 2022 [ConvNeXt] [PVTv2]