Brief Review — CenterNet2: Probabilistic Two-Stage Detection

CenterNet2, Enhancing CenterNet Using Probabilistic Two-Stage Detector

5 min readAug 26, 2024

**A class-agnostic one-stage detector predicts object likelihood. A second stage then predicts a classification score conditioned on a detection.**

Probabilistic two-stage detection
CenterNet2, by UT Austin, and Intel Labs
2021 CVPR, Over 280 Citations (Sik-Ho Tsang @ Medium)
Object Detection
2014 … 2023 [YOLOv7] [YOLOv8] [Lite DETR] 2024 [YOLOv9] [YOLOv10] [RT-DETR]
==== My Other Paper Readings Are Also Over Here ====

For two-stage detection pipelines, a standard region proposal network (RPN) cannot infer this likelihood sufficiently well.
In this paper, a probabilistic two-stage detector is proposed from any state-of-the-art one-stage detector, which are faster and more accurate.
By enhancing CenterNet using the proposed probabilistic two-stage detector, CenterNet2 is formed.

Outline

Probabilistic Two-Stage Detector
CenterNet2
Results

1. Probabilistic Two-Stage Detector

**Left: One-Stage Detector, Middle: Two-Stage Detector, Right: Proposed Probabilistic Two-Stage Detector**

An object detector aims to predict the location bi and class-specific likelihood score si for any object i for a predefined set of classes C. The object location bi is most often described by two box corners or center+size.

For each image, the goal in this paper is to produce a set of K detections as bounding boxes b1, …, bK with an associated class distribution sk(c) = P(Ck = c) for classes c ∈ C ⋃ {bg} or background to each object k.
There are two parts: A class-agnostic object likelihood P(Ok) (first stage) and a conditional categorical classification P(Ck|Ok) (second stage).

Ok = 1 indicates a positive detection in the first stage, while Ok = 0 corresponds to background.
Any negative first-stage detection Ok = 0 leads to a background Ck = bg classification: P(Ck = bg | Ok = 0) = 1.
The joint class distribution of the two-stage model then is:

The training objective is to use maximum likelihood estimation.
For annotated objects, it maximize:

which reduces to independent maximum-likelihood objectives for the first and second stage respectively.
For the background class, the maximum-likelihood objective does not factorize:

This objective ties the first- and second-stage probability estimates in their loss and gradient computation, which would slow down training prohibitively.
Two lower bounds are derived to the objective, in which the loss can be jointly optimized.
The first lower bound uses Jensen’s inequality:

This lower bound maximizes the log-likelihood of background of the second stage for any high-scoring object in the first stage.
The second bound involves just the first-stage objective:

It is found optimizing both bounds jointly to work better.
Ideally, the tightest bound is obtained by using the maximum of Eq. (3) and Eq. (4). This lower bound is within ≤ log 2 of the actual objective.

With lower bound Eq. (4) and the positive objective Eq. (2), first-stage training reduces to a maximum-likelihood estimate with positive labels at annotated objects and negative labels for all other locations.
It is equivalent to training a binary one-stage detector, or an RPN with a strict negative definition that encourages likelihood estimation and not recall.
In the proposed probabilistic formation, the classification score is multiplied by the class-agnostic detection score. This requires a strong first stage detector that not only maximizes the proposal recall, but also predicts a reliable object likelihood for each proposal. In the experiments, strong one-stage detectors are used to estimate this log-likelihood.

2. CenterNet2

**RetinaNet, CenterNet, GFL, and ATSS Performance Comparison**

RetinaNet, CenterNet, GFL, and ATSS are tried as object detector. Finally, the CenterNet, which uses probablistic two-stage approach is the best. And it is called CenterNet2.

A two-stage detector typically uses FPN levels P2-P6 while most one-stage detectors use FPN levels P3-P7. To make it compatible, authors use FPN levels P3-P7 for both one- and two-stage detectors. This modification slightly improves the baselines.
The positive IoU threshold in the second stage is increased from 0.5 to 0.6 for Faster R-CNN.
A maximum of 256 proposal boxes is used in the second stage for probabilistic two-stage detectors, and the default 1K boxes are used for RPN-based models.
NMS threshold is increased from 0.5 to 0.7.

For the backbone, default ResNet-50 is used. For SOTA comparison, large ResNeXt-32x8d-101-DCN is used. For real-time model, a lightweight DLA is used.

3. Results

3.1. Performance Comparison

While most existing real-time detectors are one-stage, here the table shows that two-stage detectors can be as fast as one-stage designs, while delivering higher accuracy.

CenterNet2 outperforms the corresponding Cascade R-CNN model with the same backbone by 1.4 percentage points in mAP.

3.2. Region Proposals

Images at the Right: CenterNet2 uses fewer proposals.

Both Cascade R-CNN and CenterNet2 get faster with fewer proposals. However, the accuracy of the original Cascade R-CNN drops steeply as the number of proposals decreases while CenterNet2 performs well even with relatively few proposals.

Brief Review — CenterNet2: Probabilistic Two-Stage Detection

CenterNet2, Enhancing CenterNet Using Probabilistic Two-Stage Detector

Outline

1. Probabilistic Two-Stage Detector

2. CenterNet2

3. Results

3.1. Performance Comparison

3.2. Region Proposals

Written by Sik-Ho Tsang

No responses yet