[Review] Evaluating Weakly Supervised Object Localization Methods Right (Weakly Supervised Object Localization)

A New Evaluation Protocol is Proposed for WSOL. It’s Found Out That CAM (CVPR’16) is NOT Worse Than ADL (CVPR’19), CutMix (ICCV’19), SPG (ECCV’18), ACoL (CVPR’18) and Hide-and-Seek (HaS, ICCV’17)

8 min readDec 19, 2020

**Accuracy Measured Using New Protocol**

In this story, Evaluating Weakly Supervised Object Localization Methods Right, by Yonsei University, LINE Plus Corp., NAVER Corp., and University of Tuebingen, is presented. In this paper:

A new evaluation protocol is proposed where full supervision is limited to only a small held-out set not overlapping with the test set.
Five most recent WSOL methods have not made a major improvement over the CAM baseline as shown above.
Existing WSOL methods have not reached the few-shot learning baseline, where the full-supervision at validation time is used for model training.

This is a paper in 2020 CVPR with over 24 citations. (Sik-Ho Tsang @ Medium)

Outline

Problem Formulation of Weakly Supervised Object Localization (WSOL)
Evaluation Protocol for WSOL
Data Splits and Hyperparameter Search
Experimental Results

1. Problem Formulation of Weakly Supervised Object Localization (WSOL)

Object localization is the task to identify whether or not the pixel belongs to the object of interest.
Given an image X, we only got its image-level label Y. This task is referred to as the weakly-supervised object localization (WSOL).

1.1. WSOL as Multiple-Instance Learning (MIL)

WSOL is interpreted as a patch classification task trained with multiple-instance learning (MIL).
The aim of WSOL is to produce a scoring function s(X) such that thresholding it at τ closely approximates binary label T.

1.2. When is WSOL ill-posed?

If background cues are more strongly associated with the target labels T than some foreground cues, WSOL still cannot be perfectly solved.
In other words, if the posterior likelihood for the image-level label Y given a foreground cue Mfg is less than the posterior likelihood given background Mbg for some foreground and background cues, no WSOL method can make a correct prediction.

1.3. How have WSOL methods addressed the ill-posedness?

Some sought architecture modifications, or data augmentation.
However, some found the operating threshold τ via “observing a few qualitative results”.
Some have evaluated their models over the test set to select reasonable hyperparameter values.

2. Evaluation Protocol for WSOL

The MaxBoxAcc and PxAP metrics for bounding box and mask ground truths, are proposed.
The localization accuracy metric entangles classification and localization performances. As the goal of WSOL is to localize objects and not to classify images correctly. To this end, we only consider the score maps sij corresponding to the ground-truth classes. This is commonly referred to as the GT-known metrics.

2.1. Masks: PxAP

When masks are available for evaluation, we can measure the pixel-wise precision and recall.
Those metrics allow users to choose the preferred operating threshold τ that provides the best precision-recall trade-off for their downstream applications.
The pixel precision and recall at threshold τ are defined as:

For threshold independence, the pixel average precision is used.

i.e., the area under curve of the pixel precision-recall curve.

2.2. Bounding Boxes: MaxBoxAcc

Pixel-wise masks are expensive to collect. Many datasets only provide box annotations.
Given the ground truth box B, we define the box accuracy at score map threshold τ and IoU threshold δ, BoxAcc (τ, δ) is:

where box(s(X(n)), τ) is the tightest box around the largest-area connected component of the mask.
In datasets where more than one bounding box are provided (e.g. ImageNet), we count the number of images where the box prediction overlaps with at least one of the ground truth boxes with IoU ≥ δ.
When δ is 0.5, the metric is identical to the commonly-called GT-known localization accuracy or CorLoc.
For score map threshold independence, we report the box accuracy at the optimal threshold τ, the maximal box accuracy is:

2.3. Better Box Evaluation: MaxBoxAccV2

MaxBoxAcc measures the performance at a fixed IoU threshold (δ = 0.5), only considering a specific level of fineness of localization outputs.
Authors suggest averaging the performance across δ ∈ {0.3, 0.5, 0.7} to address diverse demands for localization fineness.
This new metric is referred as MaxBoxAccV2. For future WSOL researches, it is encouraged using the MaxBoxAccV2 metric.

3. Data Splits and Hyperparameter Search

3.1. Data Split

**Dataset statistics “~” indicates that the number of images per class varies across classes and the average value is shown.**

Authors proposed three disjoint splits for every dataset: train-weaksup, train-fullsup, and test.
The train-weaksup contains images with weak supervision (the image-level labels).
The train-fullsup contains images with full supervision (either bounding box or binary mask).
The test split contains images with full supervision. It must be used only for the final performance report. Checking the test results multiple times with different model configurations violates the protocol, e.g. ablation experiments using test set because this will overfit the test set.

3.1.1. ImageNet

The 1.2M “train” and 10K “validation” images for 1000 classes are treated as our train-weaksup and test, respectively.
For train-fullsup, the ImageNetV2 [38] is used since the annotated bounding boxes on those images are available.

3.1.2. CUB

5994 “train” and 5794 “test” images for 200 classes, are treated as our train-weaksup and test, respectively.
For train-fullsup, 1000 extra images (5 images per class) from Flickr are collected, on which authors have annotated bounding boxes.
For ImageNet and CUB, the oracle box accuracy BoxAcc is used.

3.1.3. OpenImages

A new WSOL benchmark based on the OpenImages instance segmentation subset is used.
This fresh WSOL benchmark is the one the models have not yet overfitted.
100 classes are sub-sampled and have 29819, 2500, and 5000 images are randomly selected from the original “train”, “validation”, and “test” splits as our train-weaksup, train-fullsup, and test splits, respectively.
The pixel average precision PxAP is used.

3.2. Hyperparameter Search

Random search hyperparameter optimization is used. It is simple, effective, and parallelizable.
For each WSOL method, 30 hyperparameters are sampled to train models on train-weaksup and validate on train-fullsup. The best hyperparameter combination is then selected.
Since running 30 training sessions is costly for ImageNet (1.2M training images), 10% of images in each class is sued for fitting models during the search.

4. Experimental Results

4.1. SOTA

CAM, Hide-and-Seek (HaS), ACoL, SPG, ADL, CutMix are compared.

4.2. Few-Shot Learning (FSL) Baseline

The full supervision in train-fullsup used for validating WSOL hyperparameters can be used for training a model itself. Since only a few fully labeled samples per class are available, this setting is referred as the few-shot learning (FSL) baseline.
As a simple baseline, we consider a foreground saliency mask predictor (TPAMI’10) [29]. The last layer of a fully convolutional network (FCN) is altered into a 1×1 convolutional layer with H×W score map output. Each pixel is trained with the binary cross-entropy loss against the target mask.

4.3. Center-Gaussian Baseline

It generates isotropic Gaussian score maps centered at the images.
The standard deviation is set to 1, but note that it does not affect the MaxBoxAcc and PxAP measures. This provides a no-learning baseline for every localization method.

4.4. Comparison of WSOL Methods

The best checkpoint is used for WSOL evaluation. But later on, after acceptance by CVPR 2020, authors recommend using the final checkpoint. Because the peak performance is noise rather than real performance.
Recent WSOL methods have not led to major improvements compared to CAM.
On ImageNet, methods after CAM are generally struggling: only CutMix has seen a boost of +0.3pp on average.
On CUB, ADL has attained a +2.0pp gain on average, but ADL fails to work well on other benchmarks.
On the new WSOL benchmark, OpenImages, no method has improved over CAM, except for CutMix (+0.4pp on average).

4.5. Score Calibration and Thresholding

Measuring performance at a fixed threshold can lead to a false sense of improvement.

**Performance at varying operating thresholds** ImageNet: **BoxAcc(τ) versus τ.** OpenImages: **PxPrec(τ) versus PxRec(τ).** Both use ResNet.

Under the threshold-independent performance measures (MaxBoxAcc and PxAP) shown in the above figure, we observe that:

The methods have different optimal τ ⋆ on ImageNet.
The methods do not exhibit significantly different MaxBoxAcc or PxAP performances.

This provides an explanation of the lack of improvement observed in 4.4.

4.6. Hyperparameter Analysis

**Results of the 30 hyperparameter trials**

It is suggested to use the vanilla CAM when absolutely no full supervision is available. ACoL and ADL tend to have greater variances across benchmarks (σ = 11.9 and 9.8 on CUB).
It is conjectured that the drop threshold for adversarial erasing is a sensitive hyperparameter.
ACoL and SPG suffer from many training failures, especially on CUB (43% and 37%, respectively).

In conclusion, vanilla CAM is stable and robust to hyperparameters. Complicated design choices introduced by later methods only seem to lower the overall performances rather than providing new avenues for performance boost.

4.7. Effects of Erasing Hyperparameters

As shown in 4.4, the simple FSL method performs better than the vanilla CAM at 10, 5, and 5 fully labeled samples per class for ImageNet, CUB, and OpenImages, respectively.
The mean FSL accuracy on CUB is 92.0%, which is far better than that of the maximal WSOL performance of 70.8%.
FSL is compared against CAM at different sizes of train-fullsup, as shown in the above figure.
FSL baselines surpass the CAM results already at 1–2 full supervision per class for CUB and OpenImages.

Thus, given a few fully labeled samples, it is perhaps better to train a model with it than to search hyperparameters. Only when there is absolutely no full supervision (0 fully labeled sample), CAM is meaningful (better than the no-learning center-gaussian baseline).

Authors proposed some future research directions:

Resolve the ill-posedness via e.g. adding more background-class images.
Define the new task, semi-weakly-supervised object localization, where methods incorporating both weak and full supervision are studied.

Reference

[2020 CVPR] [Evaluating WSOL Right]
Evaluating Weakly Supervised Object Localization Methods Right

Weakly Supervised Object Localization (WSOL)

2014 [Backprop] 2016 [CAM] 2017 [Grad-CAM] [Hide-and-Seek] 2018 [ACoL] [SPG] 2019 [CutMix] [ADL] 2020 [Evaluating WSOL Right]