Review — ORE: Towards Open World Object Detection

Probably The First Open World Object Detection Paper

Sik-Ho Tsang
7 min readAug 30, 2024
Open World Object Detectionis a novel problem that has not been formally defined and addressed so far

Towards Open World Object Detection
ORE
, by Indian Institute of Technology Hyderabad, Mohamed bin Zayed University of AI, Australian National University, and Linköping University
2021 CVPR, Over 480 Citations (Sik-Ho Tsang @ Medium)

Object Detection
2014 … 2023
[YOLOv7] [YOLOv8] [Lite DETR] 2024 [YOLOv9] [YOLOv10] [RT-DETR]
==== My Other Paper Readings Are Also Over Here ====

  • A novel problem ‘Open World Object Detection’ is studied, where a model is tasked to: 1) identify objects that have not been introduced to it as ‘unknown’, without explicit supervision to do so, and 2) incrementally learn these identified unknown categories without forgetting previously learned classes, when the corresponding labels are progressively received.
  • This problem is formulated, a strong evaluation protocol is introduced, and a novel solution is provided, which is called ORE: Open World Object Detector, based on contrastive clustering and energy based unknown identification (EBUI).
  • (It is amazing that Faster R-CNN can still be used and modified to have a paper published in 2021 CVPR.)

Outline

  1. Open World Object Detection Problem
  2. ORE: Open World Object Detector
  3. Results

1. Open World Object Detection Problem

1.1. Symbols

At any time t, there is a set of known object classes as Kt = {1, 2, .., C}. There is also a set of unknown classes U = {C + 1, …}, which may be encountered during inference.

  • The known object classes Kt are assumed to be labeled in the dataset Dt = {Xt, Yt} where X and Y denote the input images and labels respectively.
  • The input image set comprises of M training images, Xt = {I1, …, IM} with label set Yt = {Y1, … ,YM}.
  • Each yk encodes a set of K object instances with their class labels and locations i.e., yk = [lk, xk, yk, wk, hk], where lk Kt and xk, yk, wk, hk denote the bounding box center coordinates, width and height respectively.

1.2. Problem Formulation

An object detection model M_C is trained to detect all the previously encountered C object classes, and can also recognize a new or unseen class instance by classifying it as an unknown, denoted by a label zero (0).

The unknown set of instances Ut can then be forwarded to a human user who can identify n new classes of interest and provide their training examples.

The learner incrementally adds n new classes and updates itself to produce an updated model M_C+n without retraining from scratch on the whole dataset.

  • The known class set is also updated Kt+1 = Kt + {C + 1, . . . , C + n}. This cycle continues over the life of the object detector, where it adaptively updates itself with new knowledge.

2. ORE: Open World Object Detector

ORE: Open World Object Detector

It is hypothesised that learning clear discrimination between classes in the latent space of object detectors could have two fold effects:

  • First, it helps the model to identify how the feature representation of an unknown instance is different from the other known instances, which helps identify an unknown instance as a novelty.
  • Second, it facilitates learning feature representations for the new class instances without overlapping with the previous classes in the latent space, which helps towards incrementally learning without forgetting.

2.1. Contrastive Clustering (CC)

  • Contrastive clustering is a natural way to have class separation in the latent space.
  • For each known class i Kt, a prototype vector pi is maintained. Mean of feature vectors is used to calculate the prototype vector. The class prototypes are gradually evolved. A fixed-length queue qi, per class for storing the corresponding features.
  • fc is a feature vector at intermediate layer for an object of class c.
  • The contrastive loss is as follows:
  • where D is any distance function and Δ defines how close a similar and dissimilar item can be.
  • The loss is started computing only after a certain number of burnin iterations (Ib) are completed.
  • After every Ip iterations, a set of new class prototypes Pnew is computed (line 8). Then the existing prototypes P are updated by weighing P and Pnew.

The computed clustering loss is added to the standard detection loss.

2.2. Auto-labelling Unknowns (ALU) with RPN

  • It is infeasible to manually label unknown-class objects.
  • Therefore, given an input image, RPN is utilized to generate a set of bounding box predictions for foreground and background instances, along with the corresponding objectness scores.

Those proposals that have high objectness score, but do not overlap with a ground-truth object are labelled as a potential unknown object. The top-k background region proposals are selected, sorted by its objectness scores, as unknown objects.

2.3. Energy Based Unknown Identifier (EBUI)

  • Given the features (fF) in the latent space F and their corresponding labels lL, an energy function E(F, L) is learnt. The formulation is based on the Energy based models (EBMs) from LeCun [27].
  • Helmholtz free energy formulation is used where energies for all values in L are combined:
  • where T is the temperature parameter. This can be formulated as:
  • where p(l|f) is the probability density for a label l, gl(f) is the l-th classification logit of the classification head g(.).
  • Free energy of the classification models in terms of their logits is defined as follows:

The above equation provides us a natural way to transform the classification head of the standard Faster R-CNN to an energy function.

  • We can see a clear separation in the energy level of the known class data-points and unknown data-points, as above.

2.4. Alleviating Forgetting

  • A balanced set of exemplars is stored to fine-tune the model after each incremental step on these.
  • At each point, a minimum of Nex instances for each class are present in the exemplar set.

3. Results

3.1. Open World Evaluation Protocol

Task composition in the proposed Open World evaluation protocol
  • Classes are grouped into a set of tasks T = {T1, …, Tt, …}. Model is trained task by task.
  • All PASCAL VOC classes and data are as the first task T1.
  • The remaining 60 classes of MS-COCO are grouped into 3 successive tasks with semantic drifts. (i.e. T2 to T4)
  • For evaluation, the Pascal VOC test split and MS-COCO val split are used.
  • 1k images from training data of each task is kept aside for validation.

3.2. Metrics

  • Since an unknown object easily gets confused as a known object, the Wilderness Impact (WI) metric [8] is used to explicitly characterises this behaviour:
  • where P_K refers to the precision of the model when evaluated on known classes and P_KU is the precision when evaluated on known and unknown classes, measured at a recall level R = 0.8.

WI should be less as the precision must not drop when unknown objects are added to the test set.

Besides WI, Absolute Open-Set Error (A-OSE) [43] is used to report the number count of unknown objects that get wrongly classified as any of the known class.

  • Both WI and A-OSE implicitly measure how effective the model is in handling unknown objects.

3.3. Open World Object Detection Results

Comparisons with Faster R-CNN

ORE has significantly lower WI and A-OSE scores, owing to an explicit modeling of the unknown.

  • When unknown classes are progressively labelled in Task 2, it can be seen that the performance of the baseline detector on the known set of classes (quantified via mAP) significantly deteriorates from 56.16% to 4.076%. The proposed balanced finetuning is able to restore the previous class performance to a respectable level (51.09%) at the cost of increased WI and A-OSE.
  • Similar trend is seen when Task 3 classes.
  • WI and A-OSE scores cannot be measured for Task 4 because of the absence of any unknown ground-truths.
Qualitative Results
  • Qualitative Results at Task 1 are shown above.

3.4. Incremental Object Detection (iOD) Results

SOTA Comparisons

ORE performs favorably well on the incremental object detection (iOD) task against the state-of-the-art as above.

This is because, ORE reduces the confusion of an unknown object being classified as a known object, which lets the detector incrementally learn the true foreground objects.

  • (There are ablation studies in the paper, please feel free to read the paper if interested.)

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.