Brief Review — PP-YOLOE: An evolved version of YOLO
Improves PP-YOLOv2 as PP-YOLOE
PP-YOLOE: An evolved version of YOLO
PP-YOLOE, by Baidu Inc.
2022 arXiv v2, Over 210 Citations (Sik-Ho Tsang @ Medium)Object Detection
2014 … 2021 [Scaled-YOLOv4] [PVT, PVTv1] [Deformable DETR] [HRNetV2, HRNetV2p] [MDETR] [TPH-YOLOv5] 2022 [Pix2Seq] [MViTv2] [SF-YOLOv5] [GLIP] [TPH-YOLOv5++] [YOLOv6] 2023 [YOLOv7] [YOLOv8] 2024 [YOLOv9]
==== My Other Paper Readings Are Also Over Here ====
Outline
- PP-YOLOE
- Results
1. PP-YOLOE
1.1. Anchor-Free
- Following FCOS, which tiles one anchor point on each pixel, upper and lower bounds are set for three detection heads to assign ground truths to corresponding feature map.
- Then, the center of bounding box is calculated to select the closest pixel as positive samples. Following YOLO series, a 4D vector (x, y, w, h) is predicted for regression. This modification makes the model a little faster with the loss of 0.3 AP.
1.2. Backbone and Neck
- RepResBlock is shown in during the training phase and during the inference phase. Firstly, the original TreeBlock (Fig. 3(a)) is simplified.
- Then, the concatenation operation is replaced with element-wise add operation (Fig. 3(b)), because of the approximation of these two operations to some extent shown in RMNet [19].
- Thus, during the inference phase, RepResBlock can be re-parameterized to a basic residual block (Fig. 3(c)) used by ResNet-34 in a RepVGG style.
- This proposed RepResBlock to build backbone and neck.
- Similar to ResNet, the proposed backbone, named CSPRepResNet, contains 1 stem composed of 3 convolution layer and 4 subsequent stages stacked by the proposed RepResBlock as shown in Fig. 3(d).
- ESE (Effective Squeeze and Extraction) layer is also used.
- THe neck is built with proposed RepResBlock and CSPRepResStage following PP-YOLOv2. Different from backbone, shortcut in RepResBlock and ESE layer in CSPRepResStage are removed in neck.
1.3. Model Scaling
- Width multiplier and depth multiplier to scale the basic backbone and neck jointly like YOLOv5:
1.4. Task Alignment Learning (TAL)
- YOLOX uses SimOTA as the label assignment strategy to improve performance.
- To further overcome the misalignment of classification and localization, task alignment learning (TAL) is proposed in TOOD, which is composed of a dynamic label assignment and task aligned loss.
- For task aligned loss, TOOD use a normalized t, namely ^t, to replace the target in loss. The Binary Cross Entropy (BCE) for the classification can be rewritten as:
1.5. Efficient Task-aligned Head (ET-head)
- The decoupled head in some prior works may make the classification and localization tasks separate and independent, and lack of task specific learning.
- As shown in Fig. 2, ESE is used to replace the layer attention in TOOD, simplify the alignment of classification branches to shortcut, and replace the alignment of regression branches with distribution focal loss (DFL) layer [16].
1.6. Loss Function
- For the learning of classification and location tasks, varifocal loss (VFL) and distribution focal loss (DFL) are chosen respectively:
2. Results
2.1. Ablation Study
Except anchor-free tech, with each component added, mAP is improved. Though anchor-free reduced mAP, anchor-free is the main stream now.
TAL is the best label assignment.
2.2. SOTA Comparisons
PP-YOLOE is the best object detection method, which outperforms YOLOv3, YOLOv4, YOLOv5, EfficientDet, PP-YOLO, and PP-YOLOv2.