Brief Review — PP-YOLOE: An evolved version of YOLO

Improves PP-YOLOv2 as PP-YOLOE

Sik-Ho Tsang
4 min readJun 9, 2024
Comparison of the PP-YOLOE and other state-of-the-art models

PP-YOLOE: An evolved version of YOLO
PP-YOLOE
, by Baidu Inc.
2022 arXiv v2, Over 210 Citations (Sik-Ho Tsang @ Medium)

Object Detection
20142021 [Scaled-YOLOv4] [PVT, PVTv1] [Deformable DETR] [HRNetV2, HRNetV2p] [MDETR] [TPH-YOLOv5] 2022 [Pix2Seq] [MViTv2] [SF-YOLOv5] [GLIP] [TPH-YOLOv5++] [YOLOv6] 2023 [YOLOv7] [YOLOv8] 2024 [YOLOv9]
==== My Other Paper Readings Are Also Over Here ====

  • PP-YOLOv2 is improved as PP-YOLOE, using anchor-free paradigm, more powerful backbone and neck equipped with CSPRepResStage, ET-head and dynamic label assignment algorithm TAL.

Outline

  1. PP-YOLOE
  2. Results

1. PP-YOLOE

PP-YOLOE

1.1. Anchor-Free

  • Following FCOS, which tiles one anchor point on each pixel, upper and lower bounds are set for three detection heads to assign ground truths to corresponding feature map.
  • Then, the center of bounding box is calculated to select the closest pixel as positive samples. Following YOLO series, a 4D vector (x, y, w, h) is predicted for regression. This modification makes the model a little faster with the loss of 0.3 AP.

1.2. Backbone and Neck

RepResBlock and CSPRepResStage
  • RepResBlock is shown in during the training phase and during the inference phase. Firstly, the original TreeBlock (Fig. 3(a)) is simplified.
  • Then, the concatenation operation is replaced with element-wise add operation (Fig. 3(b)), because of the approximation of these two operations to some extent shown in RMNet [19].
  • Thus, during the inference phase, RepResBlock can be re-parameterized to a basic residual block (Fig. 3(c)) used by ResNet-34 in a RepVGG style.
  • This proposed RepResBlock to build backbone and neck.
  • Similar to ResNet, the proposed backbone, named CSPRepResNet, contains 1 stem composed of 3 convolution layer and 4 subsequent stages stacked by the proposed RepResBlock as shown in Fig. 3(d).
  • ESE (Effective Squeeze and Extraction) layer is also used.
  • THe neck is built with proposed RepResBlock and CSPRepResStage following PP-YOLOv2. Different from backbone, shortcut in RepResBlock and ESE layer in CSPRepResStage are removed in neck.

1.3. Model Scaling

  • Width multiplier and depth multiplier to scale the basic backbone and neck jointly like YOLOv5:
Model Scaling

1.4. Task Alignment Learning (TAL)

  • YOLOX uses SimOTA as the label assignment strategy to improve performance.
  • To further overcome the misalignment of classification and localization, task alignment learning (TAL) is proposed in TOOD, which is composed of a dynamic label assignment and task aligned loss.
  • For task aligned loss, TOOD use a normalized t, namely ^t, to replace the target in loss. The Binary Cross Entropy (BCE) for the classification can be rewritten as:

1.5. Efficient Task-aligned Head (ET-head)

  • The decoupled head in some prior works may make the classification and localization tasks separate and independent, and lack of task specific learning.
  • As shown in Fig. 2, ESE is used to replace the layer attention in TOOD, simplify the alignment of classification branches to shortcut, and replace the alignment of regression branches with distribution focal loss (DFL) layer [16].

1.6. Loss Function

  • For the learning of classification and location tasks, varifocal loss (VFL) and distribution focal loss (DFL) are chosen respectively:

2. Results

2.1. Ablation Study

Ablation Study

Except anchor-free tech, with each component added, mAP is improved. Though anchor-free reduced mAP, anchor-free is the main stream now.

Different Label Assignment

TAL is the best label assignment.

2.2. SOTA Comparisons

SOTA Comparisons

PP-YOLOE is the best object detection method, which outperforms YOLOv3, YOLOv4, YOLOv5, EfficientDet, PP-YOLO, and PP-YOLOv2.

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

No responses yet