[Paper] Backprop: Visualising Image Classification Models and Saliency Maps (Weakly Supervised Object Localization)
Weakly Supervised Object Localization (WSOL) Using AlexNet
In this story, Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps (Backprop), by Visual Geometry Group, University of Oxford, is shortly presented. You may already know, this is a paper from the famous VGG research group. It is called Backprop since the latter papers call it Backprop when mentioning it.
Weakly supervised object localization (WSOL) is to find the bounding box of the main object within the image, with only the image-level label, but without the bounding box label.
In this paper:
- Two visualizing methods are proposed: One is gradient-based method and one is saliency-based method.
- For saliency-based method, GraphCut is utilized for weakly supervised object localization (WSOL).
This is a paper in 2014 ICLR Workshop with over 2200 citations. (Sik-Ho Tsang @ Medium)
- Gradient-Based Class Model Visualisation
- Image-Specific Class Saliency Visualisation
- Weakly Supervised Object Localization (WSOL)
1. Gradient-Based Class Model Visualisation
- AlexNet-like CNN is used: conv64-conv256-conv256-conv256-conv256-full4096-full4096-full1000, where convN denotes a convolutional layer with N filters, fullM — a fully-connected layer with M outputs.
- Let Sc(I) be the score of the class c, computed by the classification layer of the ConvNet for an image I. We would like to find an L2-regularised image, such that the score Sc is high:
- where λ is the regularization parameter. A locally-optimal I can be found by the back-propagation method.
- The (unnormalised) class scores Sc before softmax is used, rather than the class posteriors, returned by the soft-max layer.
- The optimization is performed with respect to the input image, using zero image as intialization, and then the training set mean image is added to the result.
2. Image-Specific Class Saliency Visualisation
- Consider the linear score model for the class c:
- It is easy to see that the magnitude of elements of w defines the importance of the corresponding pixels of I for the class c.
- In the case of deep ConvNets, the class score Sc(I) is a highly non-linear function of I. However, given an image I0, we can approximate Sc(I) with a linear function in the neighbourhood of I0 by computing the first-order Taylor expansion:
- where w is the derivative of Sc with respect to the image I at the point (image) I0:
- Another interpretation is that the magnitude of the derivative indicates which pixels need to be changed the least to affect the class score the most.
- One can expect that such pixels correspond to the object location in the image.
- The saliency map Mij = |w_h(i,j)| where h(i,j) is the index of the element of w,, corresponding to the image pixel in the i-th row and j-th column.
- It is important to note that the saliency maps are extracted using a classification ConvNet trained on the image labels, so no additional annotation is required (such as object bounding boxes or segmentation masks).
- The computation of the image-specific saliency map for a single class is extremely quick, since it only requires a single back-propagation pass.
- The above figures are some examples. The class predictions are computed on 10 cropped and reflected sub-images, we computed 10 saliency maps on the 10 sub-images, and then averaged them.
3. Weakly Supervised Object Localization (WSOL)
3.1. Segmentation Using GraphCut
- Given an image and the corresponding class saliency map, we compute the object segmentation mask using the GraphCut colour segmentation.
Conceptually, with seed provided, GraphCut is to segment the image based on color. And in this paper, the seed is provided by the saliency map.
- (GraphCut is another big research topic. If interested, please read the paper about GraphCut: “Interactive Graph Cuts for Optimal Boundary & Region Segmentation of Objects in N-D Images” in 2001 ICCV, which has over 5000 citations.)
- Foreground and background colour models were set to be the Gaussian Mixture Models. The foreground model was estimated from the pixels with the saliency higher than a threshold, set to the 95% quantile of the saliency distribution in the image; the background model was estimated from the pixels with the saliency smaller than the 30% quantile.
Once the image pixel labelling into foreground and background is computed, the object segmentation mask is set to the largest connected component of the foreground pixels.
3.2. ILSVRC-2013 Localisation Challenge
- The above object localisation method is entered into the ILSVRC-2013 localisation challenge.
- Considering that the challenge requires the object bounding boxes to be reported, the bounding boxes are computed by the object segmentation masks.
- The procedure was repeated for each of the top-5 predicted classes.
- The method achieved 46.4% top-5 error on the test set of ILSVRC-2013.
- It should be noted that the method is weakly supervised (unlike the challenge winner with 29.9% error), and the object localisation task was not taken into account during training.
[2014 ICLR Workshop] [Backprop]
Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps
Weakly Supervised Object Localization (WSOL)