Review — Coordinate Attention for Efficient Mobile Network Design
Coordinate Attention (CA), Better Than SENet
Coordinate Attention for Efficient Mobile Network Design,
Coordinate Attention (CA), by National University of Singapore, and SEA AI Lab, 2021 CVPR, Over 1000 Citations (Sik-Ho Tsang @ Medium)Image Classification
1989 … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer] [Sparse MLP] [MViTv2] [S²-MLP] [CycleMLP] [MobileOne] [GC ViT] [VAN] [ACMix] [CVNets] [MobileViT] [RepMLP] [RepLKNet] [ParNet] 2023 [Vision Permutator (ViP)]
==== My Other Paper Readings Are Also Over Here ====
- Squeeze-and-Excitation attention, in SENet, squeeze the information to global descriptor for channel attention, but they generally neglect the positional information.
- In this paper, coordinate attention (CA) is proposed, which factorizes channel attention into two 1D feature encoding processes that aggregate features along the two spatial directions, respectively, such that CA encodes a pair of direction-aware and position-sensitive attention maps, for attention.
Outline
- Revisit SE Block
- Coordinate Attention (CA)
- Results
1. Revisit SE Block
1.1. (a) SENet
- Given the input X, the squeeze step for the c-th channel is the global average pooling (GAP) process:
- where zc is the output associated with the c-th channel.
- The second step, excitation, aims to fully capture channel-wise dependencies:
- where · is channel-wise multiplication, σ is the sigmoid function, and ˆz is the result generated by a transformation:
- And T1 and T2 are two linear transformations that can be learned to capture the importance of each channel with ReLU used in between.
- In SENet, T1 and T2 are FC layers with reduction and expansion respectively, which is a concept of autoencoder.
1.2. (b) CBAM
2. Coordinate Attention (CA)
- Coordinate Attention (CA) encodes both channel relationships and long-range dependencies with precise positional information in two steps:
- Coordinate information embedding, and
- Coordinate attention generation.
2.1. Coordinate Information Embedding
- To encourage attention blocks to capture long-range interactions spatially with precise positional information, the global pooling is factorized into a pair of 1D feature encoding operations.
- Specifically, given the input X, two spatial extents of pooling kernels (H, 1) or (1,W) are used to encode each channel along the horizontal coordinate and the vertical coordinate, respectively.
- Thus, the output of the c-th channel at height h can be formulated as:
- Similarly, the output of the c-th channel at width w can be written as:
These two transformations also allow the attention block to capture long-range dependencies along one spatial direction and preserve precise positional information along the other spatial direction, which helps the networks more accurately locate the objects of interest.
2.2. Coordinate Attention Generation
- The above 2 outputs are concatenated and then sent to a shared 1×1 convolutional transformation function F1:
- where δ is a non-linear activation function.
- f has the size of C/r×(H+W) where r is the reduction ratio.
- Then f is split along the spatial dimension into two separate tensors fh of size C/r×H and fw of size C/r×W.
- Another two 1×1 convolutional transformations Fh and Fw are utilized to separately transform fh and fw to tensors with the same channel number to the input X, yielding:
- The outputs gh and gw are then expanded and used as attention weights, respectively. Finally, the output of the coordinate attention block Y can be written as:
Hence, each element in the two attention maps reflects whether the object of interest exists in the corresponding row and column. This encoding process allows the coordinate attention to more accurately locate the exact position of the object of interest and hence helps the whole model to recognize better.
2.3. Plugin
- The proposed attention blocks can be easily plugged into the inverted residual block in MobileNetV2 and the sandglass block in MobileNeXt.
3. Results
3.1. Ablation Study
- The model with attention along either direction has comparable performance with SENet.
When both the horizontal attention and the vertical attention are incorporated, the best result is obtained.
- For MobileNetV2, three typical weight multipliers, including {1.0, 0.75, 0.5}, are used.
The models with the proposed coordinate attention yield the best results under each setting. Similar results are observed in MobileNeXt.
When r is reduced to half of the original size, the model size increases but better performance can be yielded.
3.2. SOTA Comparison
The proposed coordinate attention can help better in locating the objects of interest than the SE attention (SENet) and CBAM.
Compared to the original EfficientNet-b0 with SE attention (SENet) included and other methods that have comparable parameters and computations to EfficientNet-b0, the EfficientNet-b0 using the coordinate attention achieves the best result.
3.4. Object Detection
The proposed detection model using SSDLite, MobileNetV2+CA, achieves the best results in terms of AP compared to other approaches with close parameters and computations.
3.5. Semantic Segmentation
Left: DeepLabv3 equipped with the coordinate attention performs much better than the vanilla MobileNetV2 and other attention methods.
Right: Coordinate attention can improve the segmentation results by a large margin with comparable number of learnable parameters.