Review — Coordinate Attention for Efficient Mobile Network Design

Coordinate Attention (CA), Better Than SENet

Sik-Ho Tsang
5 min readApr 21, 2023
Performance of different attention methods on three classic vision tasks.

Coordinate Attention for Efficient Mobile Network Design,
Coordinate Attention (CA), by National University of Singapore, and SEA AI Lab, 2021 CVPR, Over 1000 Citations (Sik-Ho Tsang @ Medium)

Image Classification
1989 … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer] [Sparse MLP] [MViTv2] [S²-MLP] [CycleMLP] [MobileOne] [GC ViT] [VAN] [ACMix] [CVNets] [MobileViT] [RepMLP] [RepLKNet] [ParNet] 2023 [Vision Permutator (ViP)]
==== My Other Paper Readings Are Also Over Here ====

  • Squeeze-and-Excitation attention, in SENet, squeeze the information to global descriptor for channel attention, but they generally neglect the positional information.
  • In this paper, coordinate attention (CA) is proposed, which factorizes channel attention into two 1D feature encoding processes that aggregate features along the two spatial directions, respectively, such that CA encodes a pair of direction-aware and position-sensitive attention maps, for attention.

Outline

  1. Revisit SE Block
  2. Coordinate Attention (CA)
  3. Results

1. Revisit SE Block

Block Diagram of (a) SENet, and (b) CBAM

1.1. (a) SENet

  • Given the input X, the squeeze step for the c-th channel is the global average pooling (GAP) process:
  • where zc is the output associated with the c-th channel.
  • The second step, excitation, aims to fully capture channel-wise dependencies:
  • where · is channel-wise multiplication, σ is the sigmoid function, and ˆz is the result generated by a transformation:
  • And T1 and T2 are two linear transformations that can be learned to capture the importance of each channel with ReLU used in between.
  • In SENet, T1 and T2 are FC layers with reduction and expansion respectively, which is a concept of autoencoder.

1.2. (b) CBAM

  • CBAM is another enhanced attention module. Besides GAP, global max pooling (GMP) is also used to find the important features.
  • (Please feel free to read SENet and CBAM if interested.)

2. Coordinate Attention (CA)

The proposed coordinate attention (CA) block
  • Coordinate Attention (CA) encodes both channel relationships and long-range dependencies with precise positional information in two steps:
  1. Coordinate information embedding, and
  2. Coordinate attention generation.

2.1. Coordinate Information Embedding

  • To encourage attention blocks to capture long-range interactions spatially with precise positional information, the global pooling is factorized into a pair of 1D feature encoding operations.
  • Specifically, given the input X, two spatial extents of pooling kernels (H, 1) or (1,W) are used to encode each channel along the horizontal coordinate and the vertical coordinate, respectively.
  • Thus, the output of the c-th channel at height h can be formulated as:
  • Similarly, the output of the c-th channel at width w can be written as:

These two transformations also allow the attention block to capture long-range dependencies along one spatial direction and preserve precise positional information along the other spatial direction, which helps the networks more accurately locate the objects of interest.

2.2. Coordinate Attention Generation

  • The above 2 outputs are concatenated and then sent to a shared 1×1 convolutional transformation function F1:
  • where δ is a non-linear activation function.
  • f has the size of C/r×(H+W) where r is the reduction ratio.
  • Then f is split along the spatial dimension into two separate tensors fh of size C/r×H and fw of size C/r×W.
  • Another two 1×1 convolutional transformations Fh and Fw are utilized to separately transform fh and fw to tensors with the same channel number to the input X, yielding:
  • The outputs gh and gw are then expanded and used as attention weights, respectively. Finally, the output of the coordinate attention block Y can be written as:

Hence, each element in the two attention maps reflects whether the object of interest exists in the corresponding row and column. This encoding process allows the coordinate attention to more accurately locate the exact position of the object of interest and hence helps the whole model to recognize better.

2.3. Plugin

Network implementation for different network architectures. (a) Inverted residual block proposed in MobileNetV2; (b) Sandglass bottleneck block proposed in MobileNeXt.
  • The proposed attention blocks can be easily plugged into the inverted residual block in MobileNetV2 and the sandglass block in MobileNeXt.

3. Results

3.1. Ablation Study

Result comparisons under different experiment settings of the proposed coordinate attention.
  • The model with attention along either direction has comparable performance with SENet.

When both the horizontal attention and the vertical attention are incorporated, the best result is obtained.

Different multipliers when taking MobileNetV2 (Left) and MobileNeXt (Right)
  • For MobileNetV2, three typical weight multipliers, including {1.0, 0.75, 0.5}, are used.

The models with the proposed coordinate attention yield the best results under each setting. Similar results are observed in MobileNeXt.

Different reduction ratios r.

When r is reduced to half of the original size, the model size increases but better performance can be yielded.

3.2. SOTA Comparison

Visualization of feature maps produced by models with different attention methods in the last building block using Grad-CAM.

The proposed coordinate attention can help better in locating the objects of interest than the SE attention (SENet) and CBAM.

Experimental results when taking the powerful EfficientNet-b0 as baseline.

Compared to the original EfficientNet-b0 with SE attention (SENet) included and other methods that have comparable parameters and computations to EfficientNet-b0, the EfficientNet-b0 using the coordinate attention achieves the best result.

3.4. Object Detection

Object detection results on the COCO validation set (Left), and Pascal VOC 2007 test set (Right).

The proposed detection model using SSDLite, MobileNetV2+CA, achieves the best results in terms of AP compared to other approaches with close parameters and computations.

3.5. Semantic Segmentation

Semantic segmentation results on the Pascal VOC 2012 validation set (Left), and Cityscapes (Right).

Left: DeepLabv3 equipped with the coordinate attention performs much better than the vanilla MobileNetV2 and other attention methods.

Right: Coordinate attention can improve the segmentation results by a large margin with comparable number of learnable parameters.

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

No responses yet