Brief Review — CenterNet2: Probabilistic Two-Stage Detection
CenterNet2, Enhancing CenterNet Using Probabilistic Two-Stage Detector
Probabilistic two-stage detection
CenterNet2, by UT Austin, and Intel Labs
2021 CVPR, Over 280 Citations (Sik-Ho Tsang @ Medium)Object Detection
2014 … 2023 [YOLOv7] [YOLOv8] [Lite DETR] 2024 [YOLOv9] [YOLOv10] [RT-DETR]
==== My Other Paper Readings Are Also Over Here ====
- For two-stage detection pipelines, a standard region proposal network (RPN) cannot infer this likelihood sufficiently well.
- In this paper, a probabilistic two-stage detector is proposed from any state-of-the-art one-stage detector, which are faster and more accurate.
- By enhancing CenterNet using the proposed probabilistic two-stage detector, CenterNet2 is formed.
Outline
- Probabilistic Two-Stage Detector
- CenterNet2
- Results
1. Probabilistic Two-Stage Detector
- An object detector aims to predict the location bi and class-specific likelihood score si for any object i for a predefined set of classes C. The object location bi is most often described by two box corners or center+size.
For each image, the goal in this paper is to produce a set of K detections as bounding boxes b1, …, bK with an associated class distribution sk(c) = P(Ck = c) for classes c ∈ C ⋃ {bg} or background to each object k.
There are two parts: A class-agnostic object likelihood P(Ok) (first stage) and a conditional categorical classification P(Ck|Ok) (second stage).
- Ok = 1 indicates a positive detection in the first stage, while Ok = 0 corresponds to background.
- Any negative first-stage detection Ok = 0 leads to a background Ck = bg classification: P(Ck = bg | Ok = 0) = 1.
- The joint class distribution of the two-stage model then is:
- The training objective is to use maximum likelihood estimation.
- For annotated objects, it maximize:
- which reduces to independent maximum-likelihood objectives for the first and second stage respectively.
- For the background class, the maximum-likelihood objective does not factorize:
- This objective ties the first- and second-stage probability estimates in their loss and gradient computation, which would slow down training prohibitively.
- Two lower bounds are derived to the objective, in which the loss can be jointly optimized.
- The first lower bound uses Jensen’s inequality:
- This lower bound maximizes the log-likelihood of background of the second stage for any high-scoring object in the first stage.
- The second bound involves just the first-stage objective:
- It is found optimizing both bounds jointly to work better.
- Ideally, the tightest bound is obtained by using the maximum of Eq. (3) and Eq. (4). This lower bound is within ≤ log 2 of the actual objective.
With lower bound Eq. (4) and the positive objective Eq. (2), first-stage training reduces to a maximum-likelihood estimate with positive labels at annotated objects and negative labels for all other locations.
It is equivalent to training a binary one-stage detector, or an RPN with a strict negative definition that encourages likelihood estimation and not recall.
In the proposed probabilistic formation, the classification score is multiplied by the class-agnostic detection score. This requires a strong first stage detector that not only maximizes the proposal recall, but also predicts a reliable object likelihood for each proposal. In the experiments, strong one-stage detectors are used to estimate this log-likelihood.
2. CenterNet2
RetinaNet, CenterNet, GFL, and ATSS are tried as object detector. Finally, the CenterNet, which uses probablistic two-stage approach is the best. And it is called CenterNet2.
- A two-stage detector typically uses FPN levels P2-P6 while most one-stage detectors use FPN levels P3-P7. To make it compatible, authors use FPN levels P3-P7 for both one- and two-stage detectors. This modification slightly improves the baselines.
- The positive IoU threshold in the second stage is increased from 0.5 to 0.6 for Faster R-CNN.
- A maximum of 256 proposal boxes is used in the second stage for probabilistic two-stage detectors, and the default 1K boxes are used for RPN-based models.
- NMS threshold is increased from 0.5 to 0.7.
For the backbone, default ResNet-50 is used. For SOTA comparison, large ResNeXt-32x8d-101-DCN is used. For real-time model, a lightweight DLA is used.
3. Results
3.1. Performance Comparison
While most existing real-time detectors are one-stage, here the table shows that two-stage detectors can be as fast as one-stage designs, while delivering higher accuracy.
CenterNet2 outperforms the corresponding Cascade R-CNN model with the same backbone by 1.4 percentage points in mAP.
3.2. Region Proposals
Images at the Right: CenterNet2 uses fewer proposals.
Both Cascade R-CNN and CenterNet2 get faster with fewer proposals. However, the accuracy of the original Cascade R-CNN drops steeply as the number of proposals decreases while CenterNet2 performs well even with relatively few proposals.