Brief Review — YOLO-ReT: Towards High Accuracy Real-time Object Detection on Edge GPUs

YOLO-Ret, Real Time on Jetson Nano, Jetson Xavier NX, Jetson Xavier NGX

Sik-Ho Tsang
3 min readJun 23, 2024

YOLO-ReT: Towards High Accuracy Real-time Object Detection on Edge GPUs
, by Advanced Digital Sciences Center, Hamad Bin Khalifa University, University of Illinois at Urbana-Champaign
2022 WACV, Over 20 Citations (Sik-Ho Tsang @ Medium)

Object Detection
20142022 [Pix2Seq] [MViTv2] [SF-YOLOv5] [GLIP] [TPH-YOLOv5++] [YOLOv6] 2023 [YOLOv7] [YOLOv8] 2024 [YOLOv9]
==== My Other Paper Readings Are Also Over Here ====

  • A novel multi-scale feature interaction is proposed by exploiting missing combinatorial connections between various feature scales in existing state-of-the-art methods.
  • Additionally, a novel transfer learning backbone truncation is also proposed.


  1. YOLO-Ret
  2. Results

1. YOLO-Ret


1.1. Raw Feature Collection and Redistribution (RFCR) Module

  • Existing methods of multi-scale feature interaction focus on only two adjacent feature scales at a time.
  • Furthermore, when repeatedly using the top-down and bottom-up paths, the detection accuracy of the model starts to saturate.

Inspired by NAS-FPN, a lightweight raw feature collection and redistribution (RFCR) module is proposed, which fuses raw multiscale features from the backbone together and then redistributes it back to each feature scale.

  • Such a layer does not involve any heavy computations or parameters, however allows a direct link between every pair of feature scales.
  • Despite YOLOv3 detection head having 3 output scales, RFCR can use four different backbone features, allowing us to utilize more fine-grained low-level features to improve model performance.

The raw features during collection pass through a single 1×1 convolution, and a simple weighted sum is used to fuse features together.

The fused feature map is then passed through a mobilenet 5×5 convolution block (MBConv), which is then redistributed back to various scales with upsampling and downsampling layers as required.

1.2. Backbone Truncation

Backbone Truncation
  • 3 commonly used backbones, MobileNetV2 (×0.75 and ×1.4) and EfficientNet-B3 are used for experiments and their backbone is divided into various blocks.

It can be noted from the figure, as we increase the portion of feature extraction backbone initialised with pre-trained weights, the model performance improves, emphasizing the importance of transfer learning. However, around the 60% mark the performance starts to deteriorate and fluctuate.

  • Based on the results from Figure 2 above, the last two blocks from both MobileNetV2 versions, and the last three blocks from EfficientNet, are truncated when adopting them as backbones.

2. Results

2.1. Ablation Study

  • Additional ’shortcut’ connections are also introduced in the proposed RFCR module. This additional ‘shortcut’ from shallower layers of the backbone further improves its accuracy, emphasizing the importance of low-level features for accurate detection tasks.

Overall, the execution time and accuracy are improved by combining backbone truncation and RFCR module.

2.2. SOTA Comparisons

SOTA Comparisons
  • Models are deployed to Jetson Nano, Jetson Xavier NX, Jetson Xavier NGX.

YOLO-ReT-M0.75 at 320×320 resolution outperforms Tinier-YOLO by 3.05 mAP on Pascal VOC and 0.91 mAP on COCO, while executing faster by 3.05 FPS.

On Jetson Xavier NX, YOLO-ReT-M1.4 at 320×320 resolution outperforms YOLO-Fastest-XL by 0.92 mAP on Pascal VOC and 3.34 mAP on COCO.

Even though YOLO-ReT-EB3 model at 416×416 resolution is able to push for the best performance while still executing real-time on Jetson Xavier AGX.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.