[Review] ADL: Attention-based Dropout Layer (Weakly Supervised Object Localization)

Outperforms SPG, ACoL, Hide-and-Seek & CAM

Sik-Ho Tsang
6 min readDec 12, 2020


Weakly Supervised Object Localization (WSOL)

In this story, Attention-based Dropout Layer forWeakly Supervised Object Localization, ADL, by Yonsei University, is presented.

Weakly Supervised Object Localization (WSOL) is to have object localization while without object bounding box labels, but with only image-level label, for training. In this paper:

  • Attention-based Dropout Layer (ADL) is used to hide the most discriminative part from the model for capturing the integral extent of object, and highlight the informative region for improving the recognition power of the model.

This is a paper in 2019 CVPR with over 60 citations. (Sik-Ho Tsang @ Medium)


  1. Attention-based Dropout Layer (ADL)
  2. Ablation Study
  3. Experimental Results

1. Attention-based Dropout Layer (ADL)

Attention-based Dropout Layer (ADL)

1.1. ADL Overview

  • A self-attention map is obtained by performing channel-wise average pooling on the input feature map.
  • Based on the self-attention map, two key components of ADL are produced, a drop mask and an importance map.
  • The drop mask is used to hide the most discriminative part during training. This induces the model to learn the less discriminative part as well. This drop mask is obtained by thresholding the self-attention map.
  • The importance map is used to highlight informative region for improving the classification power of the model. Owing to the importance map, the more accurate self-attention map can be produced.
  • The importance map is computed by applying sigmoid activation to the self-attention map.
  • During training, either one of the drop mask or importance map is stochastically selected at each iteration, and then the selected one is applied to the input feature map by spatial-wise multiplication.
  • During testing, ADL is dropped.

1.2. Hyperparameters

  • ADL has two main hyperparameters: drop rate and γ.
  • The drop mask Mdrop is obtained by setting each pixel to 0 if it is larger than drop threshold, and 1 if it is smaller.
  • The drop mask has 0 for the most discriminative region and 1 for otherwise.
  • γ: controls the size of the region to be dropped.
  • The size of region to be dropped increases as γ decreases.
  • However, if the drop mask is applied at every iteration, the most discriminative part is never observed during the training phase.
  • Drop Rate: indicates how frequently the drop mask is applied.

1.3. Comparison with SOTA WSOL Approaches

Comparison with SOTA WSOL Approaches
  • Hide-and-Seek (HaS) randomly erases the input image which is not efficient to remove discriminative part of the object.
  • ACoL adds two auxiliary classifiers in parallel to the backbone feature extractor, which needs certain amount of overheads.
  • SPG does not erase the most discriminative part of the object. In addition, SPG requires substantial computing resources for improving the localization accuracy.

ADL can be easily plugged into multiple feature maps of existing classification models for improving localization accuracy, with zero overheads.

2. Ablation Study

Drop mask and self-attention map at each layer of VGG-GAP
Upper: Accuracy according to drop rate. Middle: Baseline accuracy, Lower: Accuracy when each component has been deactivated.
  • Pre-trained VGG-GAP is used. ADLs are plugged in all the pooling layers and the conv5–3 layer.
  • Figure: From the figure above, it can be observed that the self-attention maps of lower-level layers (i.e., pool1 and pool2) contain class-agnostic general features.
  • Meanwhile, the self-attention maps of higher-level layers (i.e., pool4 and conv5–3) contain the class-specific features.
  • it is also observed that the drop masks from higher-level layers erase the most discriminative part more accurately than those from lower-level layers.
  • Upper: The best localization accuracy can be achieved when the drop rate is 75%.
  • When the drop mask is applied at every iteration (i.e., drop rate 100%), the classification (Top-1 Clas) and localization (Top-1 Loc) accuracy are greatly reduced.
  • Also, the classification accuracy increases as the drop rate decreases.
  • When it becomes too low (drop rate from 25% to 0%), the classification accuracy decreases again (from 68.99% to 67.78%). it is believed that this is caused by overfitting.
  • Lower: Applying the drop mask and the importance map at the same time has better localization accuracy than applying only one of them.
  • The best localization accuracy can be achieved by sacrificing the classification accuracy.
  • In addition, when the ADLs are applied to lower-level feature maps such as pool2 and pool1, the localization accuracy rather decreases. It is believed that this is because the lower-level feature maps include general features that are not related to the target class.

3. Experimental Results

Qualitative evaluation results of VGG-GAP on CUB-200–2011 and ImageNet-1k. (Red: GT, Green: Predicted)
  • The heatmap and bounding box extracted from CAM model only highlight the face of birds.
  • The model with ADL covers not only the face, but also the entire part of the bird, from head to wing.
Quantitative evaluation results on CUB-200–2011 and ImageNet-1k
  • ADL has no parameter overheads, and the computation overheads are nearly zero (e.g., 0.003% in ResNet50-SE), where SE is the SE Module in SENet.
  • To maximize the efficiency of WSOL, MobileNetV1 is employed as a backbone network. Since MobileNetV1 is a lightweight model, it is inappropriate to employ ACoL or SPG which requires huge additional computing resources. The accuracy gain of the proposed method is better than that of Hide-and-Seek (HaS).
  • When ResNet50-SE is employed as a backbone, the proposed method improves the localization accuracy by more than 15 percentage points over the state-of-the-art accuracy. The number of parameters of ResNet50-SE with ADL is much smaller than that of ACoL and SPG. 2–3% points difference is quite impressive.
  • When VGG-GAP is used as a backbone, the accuracy of ADL is better than that of CAM, but slightly lower than that of ACoL.
  • When ResNet50-SE is used as a backbone, localization accuracy of ADL is better than that of ACoL and comparable with that of SPG even though the required computing resources are much lower.
  • When Inception-v3 is used as a backbone, comparable accuracy (0.11 percentage point difference) to SPG is achieved.
ADL learns not only the snowmobile, but also the snow and tree.
  • It is observed that the classifier extracts the discriminative features from the background which appears frequently with the target object.
  • In the case of the snowmobile class, the target object often co-occurs with snow. ADL learns not only the snowmobile, but also the snow and tree.
  • This explains the gap of our accuracy gain for two datasets where CUB-200–2011 bird dataset images mainly contain sky and tree as background.


[2019 CVPR] [ADL]
Attention-based Dropout Layer forWeakly Supervised Object Localization

Weakly Supervised Object Localization (WSOL)

2014 [Backprop] 2016 [CAM] 2017 [Hide-and-Seek] 2018 [ACoL] [SPG] 2019 [ADL]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.