Review — YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information
YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information
YOLOv9, by Academia Sinica, National Taipei University of Technology, and Chung Yuan Christian University
2024 arXiv v2 (Sik-Ho Tsang @ Medium)Object Detection
2014 … 2021 [Scaled-YOLOv4] [PVT, PVTv1] [Deformable DETR] [HRNetV2, HRNetV2p] [MDETR] [TPH-YOLOv5] 2022 [Pix2Seq] [MViTv2] [SF-YOLOv5] [GLIP] [TPH-YOLOv5++] [YOLOv6] 2023 [YOLOv7] [YOLOv8]
==== My Other Paper Readings Are Also Over Here ====
- Programmable gradient information (PGI) is proposed to provide complete input information for the target task to calculate objective function, so that reliable gradient information can be obtained to update network weights.
- In addition, a new lightweight network architecture, Generalized Efficient Layer Aggregation Network (GELAN), is also proposed.
Outline
- Information Bottleneck Principle and Motivations
- YOLOv9: Programmable Gradient Information (PGI)
- YOLOv9: Generalized Efficient Layer Aggregation Network (GELAN)
- Experimtnal Results
1. Information Bottleneck Principle and Motivations
- (You may skip this section for quick read.)
According to information bottleneck principle, data X may cause information loss when going through transformation:
- where in deep neural networks, fθ(·) and gϕ(·) respectively represent the operations of two consecutive layers. One way to solve the above problem is to directly increase the size (depth or width) of the model to slow down the information loss rate.
(The “reversible” here is not the one in Reversible Residual Network.)
- Or to have the reversible functions:
- When the network’s transformation function is composed of reversible functions, more reliable gradients can be obtained to update the model:
- PreAct ResNet is one of the networks that conforms the reversible property:
- Mask modeling, where M is a dynamic binary mask, can be used such that the model is trying its best to recover the original data.
- When we apply the above principle to object detection, label Y has very few data, i.e. I(Y,X) will only occupy a very small part of I(X,X):
The goal for the lightweight model is how to accurately filter I(Y,X) from I(X,X).
2. YOLOv9: Programmable Gradient Information (PGI)
- In order to solve the aforementioned problems, a new auxiliary supervision framework called Programmable Gradient Information (PGI) is proposed as above.
- PGI mainly includes three components, namely (1) main branch, (2) auxiliary reversible branch, and (3) multi-level auxiliary information.
2.1. Main Branch
- The inference process of PGI only uses main branch.
2.2. Auxiliary Reversible Branch
- The auxiliary reversible branch is used to generate reliable gradients and update network parameters.
By providing information that maps from data to targets, the loss function can provide guidance and avoid the possibility of finding false correlations from incomplete feedforward features that are less relevant to the target.
2.3. Multi-level Auxiliary Information
- The concept of multi-level auxiliary information is to insert an integration network between the feature pyramid hierarchy layers of auxiliary supervision and the main branch, and then uses it to combine returned gradients from different prediction heads.
The characteristics of the main branch’s feature pyramid hierarchy will not be dominated by some specific object’s information. As a result, this method can alleviate the broken information problem in deep supervision. In addition, any integrated network can be used in multi-level auxiliary information.
3. YOLOv9: Generalized Efficient Layer Aggregation Network (GELAN)
- By combining two neural network architectures, CSPNet and ELAN [65], which are designed with gradient path planning, generalized efficient layer aggregation network (GELAN) is designed.
4. Experimental Results
4.1. Ablation Study
- CSP blocks perform particularly well as the computational block (CB).
CSP-ELAN as the component unit of GELAN in YOLOv9.
PGI can improve accuracy under different combinations. Especially when using ICN, stable and better results are obtained.
- The lead-head guided assignment proposed in YOLOv7 is also applied to the PGI’s auxiliary supervision, and achieved much better performance.
- In the above table, the results of gradually increasing components from baseline YOLOv7 to YOLOv9-E are shown.
The GELAN and PGI have brought all round improvement to the model.
4.2. SOTA Comparisons
Compared with lightweight and medium model YOLO MS [7], YOLOv9 has about 10% less parameters and 5∼15% less calculations, but still has a 0.4∼0.6% improvement in AP.
Compared with YOLOv8-X, YOLOv9-E has 16% less parameters, 27% less calculations, and has significant improvement of 1.7% AP.
YOLOv9 shows the huge advantages of using PGI. By accurately retaining and extracting the information needed to map the data to the target, YOLOv9 requires only 66% of the parameters while maintaining the accuracy as RT DETR-X.
4.3. Visualizations
- For ResNet, although the position of object can still be seen at the 50th layer, the boundary information has been lost. When the depth reached to the 100th layer, the whole image becomes blurry.
- Both CSPNet and the proposed GELAN perform very well, and they both can maintain features that support clear identification of objects until the 200th layer.
Among the comparisons, GELAN has more stable results and clearer boundary information.