Brief Review — YOLOX: Exceeding YOLO Series in 2021

Anchor-Free YOLO, Outperforms YOLOv4 and YOLOv5

3 min readMar 25, 2024

--

YOLOX (Figure from YOLOX GitHub)

YOLOX: Exceeding YOLO Series in 2021
YOLOX, by Megvii Technology
2021 arXiv v2, Over 3300 Citations (Sik-Ho Tsang @ Medium)
Object Detection
2014 … 2021 [Scaled-YOLOv4] [PVT, PVTv1] [Deformable DETR] [HRNetV2, HRNetV2p] [MDETR] [TPH-YOLOv5] 2022 [Pix2Seq] [MViTv2] [SF-YOLOv5] [GLIP] [TPH-YOLOv5++] 2023 [YOLOv7]
==== My Other Paper Readings Are Also Over Here ====

YOLOX is proposed by switching the YOLO detector to an anchor-free manner and applying other advanced detection techniques, i.e., a decoupled head and the leading label assignment strategy SimOTA.

Outline

YOLOX
Results

1. YOLOX

1.1. Decoupled Head

Decoupled Head

YOLOv3 is used as basedline. Originally, a single head in YOLOv3 is used to predict the classification, regression and objectness.

In YOLOX, decoupled head is proposed. It contains a 1×1 conv layer to reduce the channel dimension, followed by two parallel branches with two 3×3 conv layers respectively.

The lite decoupled head brings additional 1.1 ms (11.6 ms v.s. 10.5 ms).

Decoupled head improves the converging speed

Decoupled head also greatly improves the converging speed.

1.2. Strong Data Augmentation

Mosaic in YOLOv4, and mixup are added for data augmentation.
For small model, mixup is removed and mosaic is weaken.

1.3. Anchor-Free

Originally, clustered anchors are used, which are domain-specific and less generalized. Also, anchor mechanism increases the complexity of detection heads, as well as the number of predictions for each image.

Anchor-free mechanism significantly reduces the number of design parameters. The predictions for each location are reduced from 3 to 1 and they are directly used for predicting 4 values, i.e., 2 offsets in terms of the left-top corner of the grid, and the height and width of the predicted box.

The center location of each object is assigned as the positive sample and a scale range is pre-defined to designate the FPN level for each object.
The center 3×3 is assigned as multi-positive.

1.4. SimOTA

4 key insights are concluded for an advanced label assignment: 1). loss/quality aware, 2). center prior, 3). dynamic number of positive anchors for each ground-truth (abbreviated as dynamic top-k), 4). global view.
SimOTA first calculates pair-wise matching degree, represented by cost
In SimOTA, the cost between groundtruth gi and prediction pj is calculated as:

where Lclsij and Lregij are classficiation loss and regression loss.

For groundtruth gi, YOLOX selects the top k predictions with the least cost within a fixed center region as its positive samples. Finally, the corresponding grids of those positive predictions are assigned as positives, while the rest grids are negatives.

SimOTA raises the detector from 45.0% AP to 47.3% AP.

Component Increment

The corresponding increment of each component is as shown above.

2. Results

YOLOX, developed from YOLOv3, outperforms YOLOv5.

YOLOX-Nano, even smaller model, is developed.

SOTA Comparisons

SOTA Comparisons

YOLOX outperforms YOLOv3, YOLOv4, YOLOv5 and EfficientDet.

Artificial Intelligence

Object Detection

Sik-Ho Tsang

Written by Sik-Ho Tsang

13.6K Followers

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

Help
Status
About
Careers
Blog
Privacy
Terms
Text to speech
Teams