# Review — Coordinate Attention for Efficient Mobile Network Design

## Coordinate Attention (CA), Better Than SENet

Coordinate Attention for Efficient Mobile Network Design,Coordinate Attention (CA), by National University of Singapore, and SEA AI Lab,2021 CVPR, Over 1000 Citations(Sik-Ho Tsang @ Medium)

Image Classification1989 … 2022[ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer] [Sparse MLP] [MViTv2] [S²-MLP] [CycleMLP] [MobileOne] [GC ViT] [VAN] [ACMix] [CVNets] [MobileViT] [RepMLP] [RepLKNet] [ParNet]2023[Vision Permutator (ViP)]

==== My Other Paper Readings Are Also Over Here ====

**Squeeze-and-Excitation attention**, in SENet, squeeze the information to global descriptor for channel attention, but they generally**neglect the positional information**.- In this paper,
**coordinate attention (CA)**is proposed, which factorizes channel attention into**two 1D feature encoding**processes that**aggregate features along the two spatial directions**, respectively, such that CA**encodes a pair of direction-aware and position-sensitive attention maps**, for attention.

# Outline

**Revisit SE Block****Coordinate Attention (CA)****Results**

**1. Revisit SE Block**

## 1.1. (a) SENet

- Given the
**input**, the*X***squeeze**step for the*c*-th channel**global average pooling (GAP)**process:

- where
is the*zc***output**associated with the*c*-th channel. - The second step,
**excitation**, aims to fully capture channel-wise dependencies:

- where · is channel-wise multiplication,
is the*σ***sigmoid**function, and**ˆ***z***transformation**:

- And
and*T*1are*T*2**two linear transformations**that can be learned to capture the importance of each channel with ReLU used in between. - In SENet,
*T*1 and*T*2 are**FC layers with reduction and expansion**respectively, which is a concept of autoencoder.

## 1.2. (b) CBAM

**2. Coordinate Attention (CA)**

**Coordinate Attention (CA)**encodes both**channel relationships**and**long-range dependencies**with precise positional information in**two steps**:

**Coordinate information embedding,**and**Coordinate attention generation.**

## 2.1. Coordinate Information Embedding

- To encourage attention blocks to
**capture long-range interactions spatially**with**precise positional information**, the**global pooling is factorized into a pair of 1D feature encoding operations.** - Specifically, given the input
*X*,**two spatial extents of pooling kernels (**are used to*H*, 1) or (1,*W*)**encode each channel along the horizontal coordinate and the vertical coordinate**, respectively. - Thus, the output of the
*c*-th channel**height**can be formulated as:*h*

- Similarly, the output of the
at*c*-th channel**width**can be written as:*w*

These two transformationsalso allow the attention block tocapture long-range dependencies along one spatial directionandpreserve precise positional information along the other spatial direction, which helps the networks more accuratelylocate the objects of interest.

## 2.2. Coordinate Attention Generation

- The above 2 outputs are
**concatenated**and then sent to a shared**1×1 convolutional transformation function**:*F*1

- where
is a*δ***non-linear activation**function. has the*f***size of**where*C*/*r*×(*H*+*W*)is the*r***reduction ratio**.- Then
is*f***split**along the spatial dimension into**two separate tensors**of*fh***size**and*C*/*r*×*H*of*fw***size**.*C*/*r*×*W* - Another
**two 1×1 conv**olutional transformations*Fh*and*Fw*are utilized to**separately transform**to tensors with the same channel number to the input*fh*and*fw**X*, yielding:

- The outputs
are then*gh*and*gw***expanded**and used as attention weights, respectively. Finally, the**output of the coordinate attention block**can be written as:*Y*

Hence,

each element in the two attention maps reflects whether the object of interest exists in the corresponding row and column. This encoding process allows the coordinate attention tomore accurately locate the exact positionof the object of interest and hencehelps the whole model to recognize better.

## 2.3. Plugin

- The proposed attention blocks can be
**easily plugged**into the**inverted residual block in****MobileNetV2****sandglass block in****MobileNeXt**.

# 3. Results

## 3.1. Ablation Study

- The model with attention along either direction has comparable performance with SENet.

When both the

horizontal attentionand theverticalattentionare incorporated, thebestresult is obtained.

- For
**MobileNetV2****, three typical weight multipliers**, including**{1.0, 0.75, 0.5}**, are used.

The models with the

proposed coordinate attentionyield thebestresults under each setting.Similar resultsare observed inMobileNeXt.

When

, the model size increases butris reduced to half of the original sizebetter performancecan be yielded.

## 3.2. SOTA Comparison

The

proposed coordinate attentioncan helpbetter in locating the objects of interestthan the SE attention (SENet) and CBAM.

Compared to the originalEfficientNet-b0 with SE attention (SENet)included and other methods that have comparable parameters and computations to EfficientNet-b0, theEfficientNet-b0 using the coordinate attentionachieves thebestresult.

## 3.4. Object Detection

The proposed detection model using SSDLite,

MobileNetV2+CA, achieves thebestresults in terms of AP compared to other approaches with close parameters and computations.

## 3.5. Semantic Segmentation

Left:DeepLabv3equipped with thecoordinate attentionperforms muchbetterthan the vanillaMobileNetV2and other attention methods.

Right:Coordinate attentioncanimprovethe segmentation results by alarge marginwith comparable number of learnable parameters.