Reading: CBAM — Convolutional Block Attention Module (Image Classification)

CBAM Outperforms SENet on top of MobileNetV1, ResNeXt, WRN, & ResNet, WRN

Image for post
Image for post

In this story, “CBAM: Convolutional Block Attention Module” (CBAM), is presented. In this paper:

  • Given an intermediate feature map, BAM sequentially infers attention maps along two separate dimensions, channel and spatial, then the attention maps are multiplied to the input feature map for adaptive feature refinement.
  • CBAM can be integrated into any CNN architectures seamlessly with negligible overheads and is end-to-end trainable along with base CNNs.
  • It can be seen as an extension of BAM in 2018 BMVC.

This is a paper in 2018 ECCV with over 1000 citations. (Sik-Ho Tsang @ Medium)

Outline

  1. CBAM: General Architecture
  2. Channel Attention Module
  3. Spatial Attention Module
  4. Ablation Study on ImageNet
  5. SOTA Comparison

1. CBAM: General Architecture

Image for post
Image for post
  • CBAM sequentially infers a 1D channel attention map Mc with size of C×1×1 and a 2D spatial attention map Ms with the size of 1×H×W:
Image for post
Image for post
  • where ⨂ denotes element-wise multiplication, and F’’ is the final refined output.
  • Two modules can be placed in a parallel or sequential manner. It is found that the sequential arrangement gives a better result than a parallel arrangement.
  • For the arrangement of the sequential process, experimental result shows that the channel-first order is slightly better than the spatial-first.
  • An example of CBAM in a ResBlock is as shown below:
Image for post
Image for post

2. Channel Attention Module

Image for post
Image for post
  • Channel attention focuses on ‘what’ is meaningful given an input image.
  • To compute the channel attention efficiently, the spatial dimension of the input feature map is squeezed.
  • For aggregating spatial information, average-pooling has been commonly adopted. But it is argued that max-pooling gathers another important clue about distinctive object features to infer finer channel-wise attention.
  • Thus, both average-pooled and max-pooled features are used simultaneously.
Image for post
Image for post
  • Fcavg and Fcmax, denote average-pooled features and max-pooled features respectively. Both descriptors are then forwarded to a shared network to produce our channel attention map Mc.
  • The shared network is composed of multi-layer perceptron (MLP) with one hidden layer. To reduce parameter overhead, the hidden activation size is set to R/C=r×1×1, where r is the reduction ratio.
  • After the shared network is applied to each descriptor, the output feature vectors are merged using element-wise summation.
  • σ denotes the sigmoid function. This Mc(F) is element-wise multiplied with F to form F’.

3. Spatial Attention Module

Image for post
Image for post
  • The spatial attention focuses on ‘where’ is an informative part, which is complementary to the channel attention.
  • To compute the spatial attention, average-pooling and max-pooling operations are applied along the channel axis and then concatenate them to generate an efficient feature descriptor.
  • A convolution layer is then applied to generate a spatial attention map Ms(F) with the size of R×H×W which encodes where to emphasize or suppress.
Image for post
Image for post
  • Specifically, aggregate channel information of a feature map by using two pooling operations, generating two 2D maps: Fsavg with the size of 1×H×W and Fsmax with the size of 1×H×W.
  • σ denotes the sigmoid function and f7×7 represents a convolution operation with the filter size of 7×7.

4. Ablation Study on ImageNet

Image for post
Image for post
  • It is argued that, max-pooled features which encode the degree of the most salient part can compensate the average-pooled features which encode global statistics softly.
  • Thus, both features are used simultaneously and apply a shared network to those features.
  • CAM is an effective way to push performance further from SE used in SENet without additional learnable parameters.
Image for post
Image for post
  • The channel pooling produces better accuracy, indicating that explicitly modeled pooling leads to finer attention inference rather than learnable weighted channel pooling.
  • It is found that adopting a larger kernel size (k=7) generates better accuracy in both cases. It implies that a broad view (i.e. large receptive field) is needed for deciding spatially important regions.
  • In a brief conclusion, we use the average- and max-pooled features across the channel axis with a convolution kernel size of 7 as our spatial attention module.
Image for post
Image for post
  • From a spatial viewpoint, the channel attention is globally applied, while the spatial attention works locally.
  • It is found that generating an attention map sequentially infers a finer attention map than doing in parallel. In addition, the channel-first order performs slightly better than the spatial-first order.
  • With the final module design, the final module achieves top-1 error of 22.66%, which is much lower than SE.

5. SOTA Comparison

Image for post
Image for post
  • ResNet, WideResNet (WRN), and ResNeXt with CBAM outperform all the baselines significantly.
  • It implies that CBAM is powerful, showing the efficacy of new pooling method that generates richer descriptor and spatial attention that complements the channel attention effectively.
  • CBAM not only boosts the accuracy of baselines significantly but also favorably improves the performance of SE.
Image for post
Image for post
  • The overall overhead of CBAM is quite small in terms of both parameters and computation. CBAM is quite suitable to the light-weight network, MobileNetV1.
  • The above improvement shows the great potential of CBAM for applications on low-end devices.
Image for post
Image for post
  • Grad-CAM is a recently proposed visualization method which uses gradients in order to calculate the importance of the spatial locations in convolutional layers.
  • Grad-CAM result shows attended regions clearly as above.
  • We can clearly see that the Grad-CAM masks of the CBAM-integrated network cover the target object regions better than other methods.
Image for post
Image for post
  • Faster R-CNN is as our detection method and ImageNet pre-trained ResNet50 and ResNet101 as the baseline networks.
  • Significant improvements from the baseline, demonstrating generalization performance of CBAM on other recognition tasks.
Image for post
Image for post
  • SSD and StairNet are used as the object detector.
  • We can clearly see that CBAM improves the accuracy of all strong baselines with two backbone networks.

The accuracy improvement of CBAM comes with a negligible parameter overhead, indicating that enhancement is not due to a naive capacity increment but because of the effective feature refinement.

Written by

PhD, Researcher. I share what I've learnt and done. :) My LinkedIn: https://www.linkedin.com/in/sh-tsang/, My Paper Reading List: https://bit.ly/33TDhxG

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store