Reading: CBAM — Convolutional Block Attention Module (Image Classification)

CBAM Outperforms SENet on top of MobileNetV1, ResNeXt, WRN, & ResNet, WRN

Sik-Ho Tsang

6 min readOct 18, 2020

In this story, “CBAM: Convolutional Block Attention Module” (CBAM), is presented. In this paper:

Given an intermediate feature map, BAM sequentially infers attention maps along two separate dimensions, channel and spatial, then the attention maps are multiplied to the input feature map for adaptive feature refinement.
CBAM can be integrated into any CNN architectures seamlessly with negligible overheads and is end-to-end trainable along with base CNNs.
It can be seen as an extension of BAM in 2018 BMVC.

This is a paper in 2018 ECCV with over 1000 citations. (Sik-Ho Tsang @ Medium)

Outline

CBAM: General Architecture
Channel Attention Module
Spatial Attention Module
Ablation Study on ImageNet
SOTA Comparison

1. CBAM: General Architecture

CBAM sequentially infers a 1D channel attention map Mc with size of C×1×1 and a 2D spatial attention map Ms with the size of 1×H×W:

where ⨂ denotes element-wise multiplication, and F’’ is the final refined output.
Two modules can be placed in a parallel or sequential manner. It is found that the sequential arrangement gives a better result than a parallel arrangement.
For the arrangement of the sequential process, experimental result shows that the channel-first order is slightly better than the spatial-first.
An example of CBAM in a ResBlock is as shown below:

2. Channel Attention Module

Channel attention focuses on ‘what’ is meaningful given an input image.
To compute the channel attention efficiently, the spatial dimension of the input feature map is squeezed.
For aggregating spatial information, average-pooling has been commonly adopted. But it is argued that max-pooling gathers another important clue about distinctive object features to infer finer channel-wise attention.
Thus, both average-pooled and max-pooled features are used simultaneously.

Fcavg and Fcmax, denote average-pooled features and max-pooled features respectively. Both descriptors are then forwarded to a shared network to produce our channel attention map Mc.
The shared network is composed of multi-layer perceptron (MLP) with one hidden layer. To reduce parameter overhead, the hidden activation size is set to R/C=r×1×1, where r is the reduction ratio.
After the shared network is applied to each descriptor, the output feature vectors are merged using element-wise summation.
σ denotes the sigmoid function. This Mc(F) is element-wise multiplied with F to form F’.

3. Spatial Attention Module

The spatial attention focuses on ‘where’ is an informative part, which is complementary to the channel attention.
To compute the spatial attention, average-pooling and max-pooling operations are applied along the channel axis and then concatenate them to generate an efficient feature descriptor.
A convolution layer is then applied to generate a spatial attention map Ms(F) with the size of R×H×W which encodes where to emphasize or suppress.

Specifically, aggregate channel information of a feature map by using two pooling operations, generating two 2D maps: Fsavg with the size of 1×H×W and Fsmax with the size of 1×H×W.
σ denotes the sigmoid function and f7×7 represents a convolution operation with the filter size of 7×7.

4. Ablation Study on ImageNet

4.1. Max Pool or Avg Pool

**Comparison of different channel attention methods**

It is argued that, max-pooled features which encode the degree of the most salient part can compensate the average-pooled features which encode global statistics softly.
Thus, both features are used simultaneously and apply a shared network to those features.
CAM is an effective way to push performance further from SE used in SENet without additional learnable parameters.

4.2. Spatial and Channel Attention

**Comparison of different spatial attention methods**

The channel pooling produces better accuracy, indicating that explicitly modeled pooling leads to finer attention inference rather than learnable weighted channel pooling.
It is found that adopting a larger kernel size (k=7) generates better accuracy in both cases. It implies that a broad view (i.e. large receptive field) is needed for deciding spatially important regions.
In a brief conclusion, we use the average- and max-pooled features across the channel axis with a convolution kernel size of 7 as our spatial attention module.

4.3. Arrangement of the Channel and Spatial Attention

From a spatial viewpoint, the channel attention is globally applied, while the spatial attention works locally.
It is found that generating an attention map sequentially infers a finer attention map than doing in parallel. In addition, the channel-first order performs slightly better than the spatial-first order.
With the final module design, the final module achieves top-1 error of 22.66%, which is much lower than SE.

5. SOTA Comparison

5.1. ImageNet

**Classification results on ImageNet-1K.**

ResNet, WideResNet (WRN), and ResNeXt with CBAM outperform all the baselines significantly.
It implies that CBAM is powerful, showing the efficacy of new pooling method that generates richer descriptor and spatial attention that complements the channel attention effectively.
CBAM not only boosts the accuracy of baselines significantly but also favorably improves the performance of SE.

**Classification results on ImageNet-1K using the light-weight network, MobileNet**

The overall overhead of CBAM is quite small in terms of both parameters and computation. CBAM is quite suitable to the light-weight network, MobileNetV1.
The above improvement shows the great potential of CBAM for applications on low-end devices.

5.2. Network Visualization with Grad-CAM

Grad-CAM is a recently proposed visualization method which uses gradients in order to calculate the importance of the spatial locations in convolutional layers.
Grad-CAM result shows attended regions clearly as above.
We can clearly see that the Grad-CAM masks of the CBAM-integrated network cover the target object regions better than other methods.

5.4. MS COCO Object Detection

**Object detection mAP(%) on the MS COCO validation set**

Faster R-CNN is as our detection method and ImageNet pre-trained ResNet50 and ResNet101 as the baseline networks.
Significant improvements from the baseline, demonstrating generalization performance of CBAM on other recognition tasks.

5.5. VOC 2007 Object Detection

SSD and StairNet are used as the object detector.
We can clearly see that CBAM improves the accuracy of all strong baselines with two backbone networks.

The accuracy improvement of CBAM comes with a negligible parameter overhead, indicating that enhancement is not due to a naive capacity increment but because of the effective feature refinement.

Reference

[2018 ECCV] [CBAM]
CBAM: Convolutional Block Attention Module

Image Classification

[LeNet] [AlexNet] [Maxout] [NIN] [ZFNet] [VGGNet] [Highway] [SPPNet] [PReLU-Net] [STN] [DeepImage] [SqueezeNet] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [ResNet-38] [Shake-Shake] [Cutout] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet] [DRN] [DPN] [Residual Attention Network] [DMRNet / DFN-MR] [IGCNet / IGCV1] [Deep Roots] [MSDNet] [ShuffleNet V1] [SENet] [NASNet] [MobileNetV2] [CondenseNet] [IGCV2] [IGCV3] [FishNet] [SqueezeNext] [ENAS] [PNASNet] [ShuffleNet V2] [BAM] [CBAM] [AmoebaNet] [ESPNetv2]