[Paper] CAM: Learning Deep Features for Discriminative Localization (Weakly Supervised Object Localization)

Revisit Global Average Pooling (GAP), Weakly Supervised Object Localization While Image Classification, Outperforms Backprop

Class Activation Mapping (CAM)
  • By using GAP, Class Activation Mapping (CAM) technique is proposed for Weakly Supervised Object Localization (WSOL)
  • Object localization and image classification are performed in a single forward-pass.


  1. Class Activation Mapping (CAM) Using Global Average Pooling (GAP)
  2. AlexNet, VGGNet, GoogLeNet Using GAP for CAM
  3. Experimental Results

1. Class Activation Mapping (CAM) Using Global Average Pooling (GAP)

1.1. Network Using GAP for CAM

Class Activation Mapping (CAM) Using Global Average Pooling (GAP)
  • Similarly, we can compute a weighted sum of the feature maps of the last convolutional layer to obtain our class activation maps.
  • For a given class c, the input to the softmax is Sc:
  • Mc is defined as the class activation map for class c, where each spatial element is given by:
The CAMs of two classes from ILSVRC
  • The maps highlight the discriminative image regions used for image classification, the head of the animal for briard and the plates in barbell.
  • It is observed that the discriminative regions for different categories are different even for a given image.

1.2. Weakly Supervised Object Localization (WSOL)

  • To generate a bounding box from the CAMs, a simple thresholding technique is used to segment the heatmap.
Examples of localization from GoogLeNet-GAP

1.3. GAP vs GMP

  • GAP: It is believed that GAP loss encourages the network to identify the extent of the object as compared to GMP which encourages it to identify just one discriminative part.
  • The value can be maximized by finding all discriminative parts of an object as all low activations reduce the output of the particular map.
  • GMP: On the other hand, for Global Max Pooling (GMP), low scores for all image regions except the most discriminative one do not impact the score.
  • In experiment, GMP achieves similar classification performance as GAP, GAP outperforms GMP for localization.

2. AlexNet, VGGNet, GoogLeNet Using GAP for CAM

  • AlexNet, VGGNet, and GoogLeNet are modified to have GAP for CAM.
  • The fully-connected layers before the final output are removed, and GAP is used to replaced them followed by a fully connected softmax.
  • For AlexNet, the layers after conv5 (i.e., pool5 to prob) are removed resulting in a mapping resolution of 13×13.
  • For VGGNet, the layers after conv5–3 (i.e., pool5 to prob) are removed, resulting in a mapping resolution of 14×14.
  • For GoogLeNet, the layers after inception4e (i.e., pool4 to prob) are removed, resulting in a mapping resolution of 14×14.
  • To each of the above networks, a convolutional layer of size 3×3, stride 1, pad 1 with 1024 units is added, followed by a GAP layer and a softmax layer.
  • Each network is fine-tuned by ImageNet.

3. Experimental Results

3.1. Classification

Classification error on the ILSVRC validation set.
  • Two convolutional layers are added just before GAP resulting in the AlexNet*-GAP network, so that AlexNet*-GAP performs comparably to AlexNet.
  • It is noted that GoogLeNet-GAP and GoogLeNet-GMP have similar performance on classification, as expected.

3.2. Weakly-Supervised Object Localization (WSOL)

Localization error on the ILSVRC validation set.
  • The result is remarkable as there is no annotated bounding box during training.
  • And the CAM approach significantly outperforms the Backprop approach.
  • GoogLeNet-GAP outperforms GoogLeNet-GMP by a reasonable margin illustrating the importance of average pooling over max pooling for identifying the extent of objects.
Class activation maps from CNN-GAPs and the class-specific saliency map from the Backprop methods.
Localization error on the ILSVRC test set
  • Select two bounding boxes (one tight and one loose) from the class activation map of the top 1st and 2nd predicted classes and one loose bounding boxes from the top 3rd predicted class.
  • This heuristic is a trade-off between classification accuracy and localization accuracy.
  • GoogLeNet-GAP with heuristic achieves a top-5 error rate of 37.1% in a weakly-supervised setting, which is surprisingly close to the top-5 error rate of AlexNet (34.2%) in a fully-supervised setting.

3.3. Deep Features for Generic Localization

Classification accuracy on representative scene and object datasets for different deep features.
  • A linear SVM is trained on the output of the GAP layer.
  • GoogLeNet-GAP and GoogLeNet significantly outperform AlexNet.
  • GoogLeNet-GAP features are competitive with the state-of-the-art as generic visual features.
  • As shown above, the most discriminative regions tend to be highlighted across all datasets.
  • The CAM approach is effective for generating localizable deep features for generic tasks.

3.4. Fine-grained Recognition

Fine-grained classification performance on CUB200
  • When using bounding box annotations, this accuracy increases to 70.5%.
  • GoogLeNet-GAP is able to accurately localize the bird in 41.0% of the images under the 0.5 intersection over union (IoU) criterion, as compared to a chance performance of 5.5%.
CAMs and the inferred bounding boxes (in red) for selected images from four bird categories in CUB200.


[2016 CVPR] [CAM]
Learning Deep Features for Discriminative Localization

Weakly Supervised Object Localization (WSOL)

2014 [Backprop] 2016 [CAM]

My Other Previous Paper Readings



PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store