# Review — SAN: Exploring Self-attention for Image Recognition

## SAN, Self-Attention Networks

--

Exploring Self-attention for Image Recognition, by CUHK, and Intel Labs

SAN2020 CVPR, Over 300 Citations(Sik-Ho Tsang @ Medium)

Image Classification, Self-Attention

**Two forms**of self-attention are considered.- One is
**pairwise self-attention**, which generalizes standard dot-product attention and is fundamentally a set operator. - The other is
**patchwise self-attention**, which is strictly more powerful than convolution.

# Outline

**Pairwise Self-Attention****Patchwise Self-Attention****Self-Attention Network (SAN) Variants****Experimental Results**

# 1. Pairwise Self-Attention

## 1.1. Definitions

- The
**pairwise self-attention**has the following form:

- where
**⊙**is the**Hadamard product (i.e. element-wise multiplication)**,*i*is the spatial index of feature vector*xi*(i.e., its location in the feature map), and*R*(*i*) is the local footprint of the aggregation. - And
is decomposed as follows:*α*

- where the relation function
*δ***outputs a single vector that represents the features**The function*xi*and*xj*.then*γ*(dimension mapping)**maps**this vector into a vector that can be combined**with**.*β*(*xj*) **Multiple forms for the relation function**are explored:*δ*

and*φ**ψ***trainable transformations**such as linear mappings, and have matching output dimensionality.

## 1.2. Position Encoding

**The horizontal and vertical coordinates**along the feature map are**first normalized to the range [-1, 1]**in each dimension. These normalized two-dimensional coordinates are**then passed through a trainable linear layer**, which can map them to an appropriate range for each layer in the network.- For each pair (
*i*,*j*) such that*j*∈*R*(*i*),**the relative position information is encoded by calculating the difference**The output of (*pi*—*pj*.*xi*,*xj*) is augmented by**concatenating [**prior to the mapping.*pi*—*pj*]

# 2. Patchwise Self-Attention

**Patchwise self-attention**has the following form:

- where
*xR*(*i*)**patch of feature vectors in the footprint**.*R*(*i*)**(**is a*xR*(*i*))**tensor of the same spatial dimensionality as the patch**.*xR*(*i*) is the*α*(*xR*(*i*))*j***vector at location**in this tensor,*j***corresponding spatially to the vector**.*xj*in*xR*(*i*)

Unlike pairwise self-attention,

patchwise self-attention is no longer a set operation with respect to the featuresxj.

is decomposed as follows:*α*

- where
*γ*is for dimension mapping. **Multiple forms for the relation function**are explored:*δ*

- where
and*φ**ψ***trainable transformations**

**3. Self-Attention Network (SAN) Variants**

## 3.1. Self-Attention Block

- Residual block, originated in ResNet, is used.
- Within the main path, the input feature tensor (channel dimensionality
*C*) is passed through two processing streams. - The
**left stream**evaluates the**attention weights**by computing the*α***function**(via the*δ***mappings**) and a subsequent*φ*and*ψ***mapping**.*γ* - The
**right stream**applies a**linear transformation**.*β*

## 3.2. SAN Variants

- The backbone of SAN has
**five stages**, yielding a resolution reduction factor of 32. **Consecutive stages**are**bridged by transition layers**that**reduce spatial resolution**and**expand channel dimensionality**. The transition comprises a batch normalization layer, a ReLU, 2×2 max pooling with stride 2, and a linear mapping.- The output of the last stage is processed by a classification layer that comprises
**global average pooling**, a**linear layer**, and a**softmax**. - The
**local footprint**controls the amount of context gathered by a self-attention operator. The footprint size is to*R*(*i*)**7×7**for the**last four stages**of SAN. The footprint is set to**3×3**in the**first stage**due to the high resolution of that stage and the consequent memory consumption.

SAN10,SAN15, andSAN19are designed, which arein rough correspondence withResNet26,ResNet38, andResNet50. The number X in SANX refers to the number of self-attention blocks.

- The
**convolution**does**not adapt to the content**of the image.**Scalar attention**produces**scalar weights**that do not vary along the channel dimension.

The proposed operators efficiently

compute attention weightsthatadapt across both spatial dimensions and channels.

- All models are
**trained from scratch**for**100 epochs**with minibatch size 256 on 8 GPUs.

# 4. Experimental Results

## 4.1. Comparison With ResNet

- The
**pairwise**models**match or outperform**the convolutional baselines, with similar or lower parameter and FLOP budgets.

The

patchwisemodels performeven better.

## 4.2. Relation Function

For

pairwiseself-attention, summation, subtraction, and Hadamard product achievesimilar accuracy.For

patchwiseself-attention,concatenationachievesslightly higher accuracythan star-product and clique-product.

## 4.3. Mapping Function

For

pairwisemodels, usingtwo linear layersyields thehighest accuracy.For

patchwisemodels,different settingsyieldsimilar accuracy.

## 4.4. Transformation Functions

For

, withφ=ψ=βr1=r2=4, yieldscomparable accuracytobut atφ=ψ≠βhigher FLOP counts.

## 4.5. Footprint Size

The

accuracy initially increaseswith footprint size andthen saturates.

## 4.6. Position Encoding

Without position encoding, top-1accuracy dropsby 5 percentage points.

## 4.7. Zero-Shot Generalization to Rotated Images

- ImageNet images from the val-original set are
**rotated and flipped**in one of four ways.

Pairwise self-attention models are

more robustto this kind of manipulation than convolutional networks.

## 4.8. **Adversarial Attacks**

The self-attention models are

much more robustthan convolutional networks.

Self-attention is introduced that efficiently adapts weights across both spatial dimensions and channels.

## Reference

[2020 CVPR] [SAN]

Exploring Self-attention for Image Recognition

## 1.1. Image Classification

**2020** [SAN] … **2022 **[ConvNeXt] [PVTv2]