Brief Review — CenterNet2: Probabilistic Two-Stage Detection

CenterNet2, Enhancing CenterNet Using Probabilistic Two-Stage Detector

Sik-Ho Tsang
5 min readAug 26, 2024
A class-agnostic one-stage detector predicts object likelihood. A second stage then predicts a classification score conditioned on a detection.

Probabilistic two-stage detection
CenterNet2
, by UT Austin, and Intel Labs
2021 CVPR, Over 280 Citations (Sik-Ho Tsang @ Medium)

Object Detection
2014 … 2023
[YOLOv7] [YOLOv8] [Lite DETR] 2024 [YOLOv9] [YOLOv10] [RT-DETR]
==== My Other Paper Readings Are Also Over Here ====

  • For two-stage detection pipelines, a standard region proposal network (RPN) cannot infer this likelihood sufficiently well.
  • In this paper, a probabilistic two-stage detector is proposed from any state-of-the-art one-stage detector, which are faster and more accurate.
  • By enhancing CenterNet using the proposed probabilistic two-stage detector, CenterNet2 is formed.

Outline

  1. Probabilistic Two-Stage Detector
  2. CenterNet2
  3. Results

1. Probabilistic Two-Stage Detector

Left: One-Stage Detector, Middle: Two-Stage Detector, Right: Proposed Probabilistic Two-Stage Detector
  • An object detector aims to predict the location bi and class-specific likelihood score si for any object i for a predefined set of classes C. The object location bi is most often described by two box corners or center+size.

For each image, the goal in this paper is to produce a set of K detections as bounding boxes b1, …, bK with an associated class distribution sk(c) = P(Ck = c) for classes cC ⋃ {bg} or background to each object k.

There are two parts: A class-agnostic object likelihood P(Ok) (first stage) and a conditional categorical classification P(Ck|Ok) (second stage).

  • Ok = 1 indicates a positive detection in the first stage, while Ok = 0 corresponds to background.
  • Any negative first-stage detection Ok = 0 leads to a background Ck = bg classification: P(Ck = bg | Ok = 0) = 1.
  • The joint class distribution of the two-stage model then is:
  • The training objective is to use maximum likelihood estimation.
  • For annotated objects, it maximize:
  • which reduces to independent maximum-likelihood objectives for the first and second stage respectively.
  • For the background class, the maximum-likelihood objective does not factorize:
  • This objective ties the first- and second-stage probability estimates in their loss and gradient computation, which would slow down training prohibitively.
  • Two lower bounds are derived to the objective, in which the loss can be jointly optimized.
  • The first lower bound uses Jensen’s inequality:
  • This lower bound maximizes the log-likelihood of background of the second stage for any high-scoring object in the first stage.
  • The second bound involves just the first-stage objective:
  • It is found optimizing both bounds jointly to work better.
  • Ideally, the tightest bound is obtained by using the maximum of Eq. (3) and Eq. (4). This lower bound is within ≤ log 2 of the actual objective.

With lower bound Eq. (4) and the positive objective Eq. (2), first-stage training reduces to a maximum-likelihood estimate with positive labels at annotated objects and negative labels for all other locations.

It is equivalent to training a binary one-stage detector, or an RPN with a strict negative definition that encourages likelihood estimation and not recall.

In the proposed probabilistic formation, the classification score is multiplied by the class-agnostic detection score. This requires a strong first stage detector that not only maximizes the proposal recall, but also predicts a reliable object likelihood for each proposal. In the experiments, strong one-stage detectors are used to estimate this log-likelihood.

2. CenterNet2

RetinaNet, CenterNet, GFL, and ATSS Performance Comparison

RetinaNet, CenterNet, GFL, and ATSS are tried as object detector. Finally, the CenterNet, which uses probablistic two-stage approach is the best. And it is called CenterNet2.

  • A two-stage detector typically uses FPN levels P2-P6 while most one-stage detectors use FPN levels P3-P7. To make it compatible, authors use FPN levels P3-P7 for both one- and two-stage detectors. This modification slightly improves the baselines.
  • The positive IoU threshold in the second stage is increased from 0.5 to 0.6 for Faster R-CNN.
  • A maximum of 256 proposal boxes is used in the second stage for probabilistic two-stage detectors, and the default 1K boxes are used for RPN-based models.
  • NMS threshold is increased from 0.5 to 0.7.

For the backbone, default ResNet-50 is used. For SOTA comparison, large ResNeXt-32x8d-101-DCN is used. For real-time model, a lightweight DLA is used.

3. Results

3.1. Performance Comparison

Real-Time Model Comparison

While most existing real-time detectors are one-stage, here the table shows that two-stage detectors can be as fast as one-stage designs, while delivering higher accuracy.

SOTA Comparison

CenterNet2 outperforms the corresponding Cascade R-CNN model with the same backbone by 1.4 percentage points in mAP.

3.2. Region Proposals

Visualization of Region Proposals

Images at the Right: CenterNet2 uses fewer proposals.

Both Cascade R-CNN and CenterNet2 get faster with fewer proposals. However, the accuracy of the original Cascade R-CNN drops steeply as the number of proposals decreases while CenterNet2 performs well even with relatively few proposals.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.