Review — YOLOv10: Real-Time End-to-End Object Detection

NMS-Free Training, and Holistic Efficiency-Accuracy Driven Model Design are Proposed

6 min readJun 4, 2024

**Latency-Accuracy and Size-Accuracy Curves** (Perhaps since YOLOv9 is still very new, no time comparions for YOLOv9 yet.)

YOLOv10: Real-Time End-to-End Object Detection
YOLOv10, by Tsinghua University
2024 arXiv v1 (Sik-Ho Tsang @ Medium)
Object Detection
2014 … 2021 [Scaled-YOLOv4] [PVT, PVTv1] [Deformable DETR] [HRNetV2, HRNetV2p] [MDETR] [TPH-YOLOv5] 2022 [Pix2Seq] [MViTv2] [SF-YOLOv5] [GLIP] [TPH-YOLOv5++] [YOLOv6] 2023 [YOLOv7] [YOLOv8] [Lite DETR] 2024 [YOLOv9]
==== My Other Paper Readings Are Also Over Here ====

YOLOv10 aims to further advance the performance-efficiency boundary of YOLOs from both the post-processing and the model architecture.
First, consistent dual assignments for NMS-free training is proposed, which brings the competitive performance and low inference latency simultaneously.
Holistic efficiency-accuracy driven model design strategy is introduced. Various components of YOLOs are comprehensively optimized from both the efficiency and accuracy perspectives, which greatly reduces the computational overhead and enhances the capability.

Outline

YOLOv10: Consistent Dual Assignments for NMS-free Training
YOLOv10: Holistic Efficiency-Accuracy Driven Model Design
Results

1. YOLOv10: Consistent Dual Assignments for NMS-free Training

YOLOs rely on the NMS post-processing, which causes the suboptimal inference efficiency. NMS-free training strategy is to be used.

1.1. Dual Label Assignments

Unlike one-to-many assignment, one-to-one matching assigns only one prediction to each ground truth, avoiding the NMS post-processing. However, it leads to weak supervision.

Dual label assignments for YOLOs to combine the best of both strategies.

As in Fig. 2(a), another one-to-one head is incorporated for YOLOs. It retains the identical structure and adopts the same optimization objectives as the original one-to-many branch but leverages the one-to-one matching to obtain label assignments.
During training, two heads are jointly optimized with the model, allowing the backbone and neck to enjoy the rich supervision provided by the one-to-many assignment.

During inference, we discard the one-to-many head and utilize the one-to-one head to make predictions.

1.2. Consistent Matching Metric

To achieve prediction aware matching for both branches, a uniform matching metric is used:

where p is the classification score, ˆb and b denote the bounding box of prediction and instance, respectively. s represents the spatial prior indicating whether the anchor point of prediction is within the instance.
The one-to-many and one-to-one metrics as m_o2m=m(α_o2m, β_o2m) and m_o2o=m(α_o2o, β_o2o), respectively.

In dual label assignments, the one-to-many branch provides much richer supervisory signals than one-to-one branch. Intuitively, if we can harmonize the supervision of the one-to-one head with that of one-to-many head, we can optimize the one-to-one head towards the direction of one-to-many head’s optimization.

There is analysis of the supervision gap between the two heads:

(Please read the paper directly if interested.)

2. Holistic Efficiency-Accuracy Driven Model Design

2.1. Lightweight Classification Head

The regression head undertakes more significance for the performance of YOLOs. Consequently, the overhead of classification head is reduced without worrying about hurting the performance greatly.

Therefore, a lightweight architecture is simply adopted for the classification head, which consists of two depthwise separable convolutions with the kernel size of 3×3 followed by a 1×1 convolution.

2.2. Spatial-channel Decoupled Downsampling

YOLOs typically leverage regular 3×3 standard convolutions with stride of 2, achieving spatial downsampling and channel transformation. This introduces non-negligible computational cost.

Decouple the spatial reduction and channel increase operations enables more efficient downsampling.

YOLOv10 firstly leverages the pointwise convolution to modulate the channel dimension and then utilize the depthwise convolution to perform spatial downsampling.

2.3. Rank-guided Block Design

Intrinsic rank [31, 15] is used to analyze the redundancy of each stage. A lower rank implies greater redundancy.
As in Fig. 3(a), in YOLOv8, deep stages and large models are prone to exhibit more redundancy.

As in Fig. 3(b), a compact inverted block (CIB) structure is proposed, which adopts the cheap depthwise convolutions for spatial mixing and cost-effective pointwise convolutions for channel mixing.

Given a model, All stages are sorted based on their intrinsic ranks in ascending order. Then, the performance variation of replacing the basic block in the leading stage with CIB is further inspected.
If there is no performance degradation compared with the given model, it is proceeded with the replacement of the next stage and halt the process otherwise.

2.4. Accuracy Driven Model Design

The large-kernel convolution and self-attention are further explored for accuracy driven design.

2.4.1. Large-Kernel Convolution

Large-kernel depthwise convolutions in CIB are used within the deep stages. Specifically, the kernel size of the second 3×3 depthwise convolution in the CIB is increased to 7×7, following ConvNeXt.

Additionally, the structural reparameterization technique (RepVGG, RepLKNet and RepViT) [10, 9, 53] is employed to bring another 3×3 depthwise convolution branch to alleviate the optimization issue without inference overhead.
Furthermore, as the model size increases, its receptive field naturally expands, with the benefit of using large-kernel convolutions diminishing.

Therefore, large-kernel convolution is only adopted for small model scales.

2.4.2. Partial Self-Attention (PSA)

As in Fig. 3(c), an efficient PSA module design is proposed.

Specifically, the features are evenly partitioned across channels into two parts after the 1×1 convolution. Only one part is fed into the N_PSA blocks comprised of multi-head self-attention module (MHSA) and feed-forward network (FFN).

Two parts are then concatenated and fused by a 1×1 convolution.
Besides, YOLOv10 follows LeViT to assign the dimensions of the query and key to half of that of the value in MHSA and replace the LayerNorm with BatchNorm for fast inference.
Furthermore, PSA is only placed after the Stage 4 with the lowest resolution, avoiding the excessive overhead from the quadratic computational complexity of self-attention.

In this way, the global representation learning ability can be incorporated into YOLOs with low computational costs.

(My personal comment: For model architecture enhancement, making the model smaller size does not necessarily mean that inference time is reduced.)

3. Results

3.1. Model Variants

YOLOv10 has the same variants as YOLOv8, i.e., N / S / M / L / X.
Besides, a new variant YOLOv10-B is derived by simply increasing the width scale factor of YOLOv10-M.