Reading: BAM — Bottleneck Attention Module (Image Classification)

In this story, BAM: Bottleneck Attention Module (BAM), by, Korea Advanced Institute of Science and Technology (KAIST), and Adobe Research, is presented. In this story:

  • A new module, Bottleneck Attention Module (BAM), is designed, that can be integrated with any feed-forward CNNs.
  • This module infers an attention map along two separate pathways, channel and spatial.
  • It is placed at each bottleneck of models where the downsampling of feature maps occurs.

This is a paper in 2018 BMVC with over 100 citations. (Sik-Ho Tsang @ Medium)

Outline

  1. Overview of BAM: Bottleneck Attention Module
  2. Details of BAM: Bottleneck Attention Module
  3. Ablation Study
  4. SOTA Comparison

1. Overview of BAM: Bottleneck Attention Module

The Placement of BAM: Bottleneck Attention Module

1.1. The Placement of BAM

  • On the CIFAR-100 and ImageNet classification tasks, authors observe performance improvements over baseline networks by placing BAM.
  • Interestingly, the improvement can be observed when multiple BAMs located at different bottlenecks build a hierarchical attention as shown above.

1.2. BAM as 3D Attention Map

  • With F is the input feature map of size C×H×W, BAM is the 3D Attention Map with the size of C×H×W as well:
  • where ⨂ denotes element-wise multiplication.
  • A residual learning scheme along with the attention mechanism is adopted to facilitate the gradient flow.
  • The channel attention Mc(F) with size of C and the spatial attention Ms(F) with the size of H×W at two separate branches are computed, then the final attention map M(F) as:
  • where σ is a sigmoid function. Both branch outputs are resized to C×H×W before addition.

2. Details of BAM: Bottleneck Attention Module

Details of BAM: Bottleneck Attention Module

2.1. Channel Attention Branch

  • For this part, it is similar to the one in SENet.
  • First, take global average pooling on the feature map F and produce a channel vector Fc with the size of C×1×1. This vector softly encodes global information in each channel.
  • To estimate attention across channels from the channel vector Fc, a multi-layer perceptron (MLP) with one hidden layer is used.
  • To save a parameter overhead, the hidden activation size is set to C/r×1×1, where r is the reduction ratio.
  • After the MLP, a batch normalization (BN) layer is added to adjust the scale with the spatial branch output.

2.2. Spatial Attention Branch

  • A spatial attention map Ms(F) of the size H×W to emphasize or suppress features in different spatial locations.
  • It is important to have a large receptive field to effectively leverage contextual information.
  • The dilated convolution, originated in DeepLab and DilatedNet, is used to enlarge the receptive fields with high efficiency.
  • The “bottleneck structure”, suggested by ResNet, is adopted in the spatial branch, which saves both the number of parameters and computational overhead.
  • Specifically, the feature F of the size C×H×W is projected into a reduced dimension C/r×H×W using 1×1 convolution to integrate and compress the feature map across the channel dimension. The same reduction ratio r with the channel branch is used for simplicity.
  • After the reduction, two 3×3 dilated convolutions are applied to utilize contextual information effectively.
  • Finally, the features are again reduced to 1×H×W spatial attention map using 1×1 convolution.
  • For a scale adjustment, a batch normalization layer is applied at the end of the spatial branch.

2.3. Combination of Two Attention Branches

  • Since the two attention maps have different shapes, the attention maps are expanded to the size of C×H×W before combining them.
  • Among various combining methods, such as element-wise summation, multiplication, or max operation, element-wise summation is chosen for efficient gradient flow.

3. Ablation Study

Ablation Study on CIFAR-100
  • ResNet-50 is used as the baseline network.

3.1. (a) Dilation Rate

  • The performance improvement is observed with larger dilation values, though it is saturated at the dilation value of 4.

3.2. (a) Reduction Ratio

  • The reduction ratio is directly related to the number of channels in both attention branches, which enables to control the capacity and overhead of our module. Interestingly, the reduction ratio of 16 achieves the best accuracy, even though the reduction ratios of 4 and 8 have higher capacity.
  • Thus, the dilation value as 4 and the reduction ratio as 16 in the following experiments.

3.3. (b) Branch Combination or Separation

  • Although each attention branch is effective to improve performance over the baseline, we observe significant performance boosting when we use both branches jointly.

3.4. (b) Branch Combination method

  • In terms of the information flow, the element-wise summation is an effective way to integrate and secure the information from the previous layers.
  • In the backward phase, the gradient is distributed equally to all of the inputs, leading to efficient training. Element-wise product, which can assign a large gradient to the small input, makes the network hard to converge, yielding the inferior performance. Element-wise maximum, which routes the gradient only to the higher input, provides a regularization effect to some extent, leading to unstable training since our module has few parameters.

3.5. (c) Comparison with placing original convblocks

  • Auxiliary convolution blocks are added which have the same topology with their baseline convolution blocks, then compare it with BAM.
  • We can obviously notice that plugging BAM not only produces superior performance but also puts less overhead than naively placing the extra layers.
  • BAM is not merely due to the increased depth but because of the effective feature refinement.

3.6 Bottleneck: The efficient point to place BAM

Bottleneck v.s. Inside each Convolution Block.
  • Placing the module at the bottleneck is effective in terms of overhead/accuracy trade-offs.
  • It puts much less overheads with better accuracy in most cases except PreResNet-110 (Pre-Activation ResNet).

4. SOTA Comparison

4.1. CIFAR-100

CIFAR-100 experiment results
  • While ResNet101 and ResNeXt29 16×64d networks achieve 20.00% and 17.25% error respectively, ResNet50 with BAM and ResNeXt29 8×64d with BAM achieve 20.00% and 16.71% error respectively using only half of the parameters.
  • BAM can efficiently raise the capacity of networks with a fewer number of network parameters.

4.2. ImageNet-1K

ImageNet classification results
  • Baseline networks of ResNet, WideResNet (WRN), and ResNeXt which are used for ImageNet classification task.
  • The networks with BAM outperform all the baselines.
  • For compact networks designed for mobile and embedded systems such as MobileNetV1 and SqueezeNet, BAM boosts the accuracy of all the models with little overheads. Since we do not adopt any squeezing operation.

4.3. MS COCO & VOC 2007 Object Detection

Experiments on detection tasks: MS-COCO and VOC 2007

4.4. Comparison with Squeeze-and-Excitation (SE)

Comparison with Squeeze-and-Excitation
  • BAM outperforms SE by SENet in most cases with fewer parameters.
  • BAM module requires slightly more GFLOPS but has much less parameters than SE.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store