Review — Scaled-YOLOv4: Scaling Cross Stage Partial Network

Comparison of the proposed Scaled-YOLOv4 and other state-of-the-art object detectors
  • Then, A network scaling approach that modifies not only the depth, width, resolution, but also structure of the network, which finally forms Scaled-YOLOv4.


  1. Principles of Model Scaling
  2. Scaled-YOLOv4: CSP-ized YOLOv4 / YOLOv4-CSP
  3. Scaled-YOLOv4: YOLOv4-tiny
  4. Scaled-YOLOv4: YOLOv4-large
  5. Ablation Study
  6. Scaled-YOLOv4 for Object Detection Results

1. Principles of Model Scaling

  • Some factors are considered when working on model scaling for object detection task. (If not interested, please skip to Section 2 directly.)

1.1. General Principle of Model Scaling

FLOPs of different computational layers with different model scaling factors.
  • Let the scaling factors that can be used to adjust the image size, the number of layers, and the number of channels be α, β, and γ, respectively.
  • For the k-layer CNNs with b base layer channels, when these scaling factors vary, the corresponding changes on FLOPs are shown as above table.
FLOPs of different computational layers with/without CSP-ization.
  • In brief, CSPNet splits the input into two paths. One performs convolutions. One performs no convolution. They are fused at the output.
  • CSPNet can effectively reduce the amount of computations (FLOPs) on ResNet, ResNeXt, and Darknet by 23.5%, 46.7%, and 50.0%, respectively.
  • (If interested, please feel free to visit CSPNet.)

1.2. Scaling Tiny Models for Low-End Devices

Dense layer in DenseNet & OSA layer in VoVNet/OSANet
FLOPs of Dense layer and OSA layer.
  • Lightweight models are different from large models in that their parameter utilization efficiency must be higher.
  • The network with efficient parameter utilization is analyzed, such as the computation load of DenseNet and OSANet, where g means growth rate.
  • The order of computation complexity of DenseNet is O(whgbk), and that of OSANet is O(max(whbg, whkg2)).
  • (If interested, please feel free to visit DenseNet & VoVNet/OSANet.)
Number of channel of OSANet, CSPOSANet, and CSPOSANet with partial in computational block (PCB).
  • This feature is to re-plan the b channels of the base layer and the kg channels generated by computational block, and split them into two paths with equal channel numbers, as shown above.
  • When the number of channel is b+kg, these channels are split into 2 paths.
The CIO of OSA, CSP, and the designed CSPOSANet

1.3. Scaling Large Models for High-End GPUs

Model scaling factors of different parts of object detectors
  • The biggest difference between image classification and object detection is that the former only needs to identify the category of the largest component in an image, while the latter needs to predict the position and size of each object in an image.
Effect of receptive field caused by different model scaling factors
  • The compound of {size^input, #stage} turns out with the best impact.

2. Scaled-YOLOv4: CSP-ized YOLOv4 / YOLOv4-CSP

2.1. Backbone

  • The amount of computation of each CSPDarknet stage is whb²(9/4+3/4+5k/2).
  • According to the previous section, CSPDarknet stage will have a better computational advantage over Darknet stage only when k>1 is satisfied.
  • The number of residual layer owned by each stage in CSPDarknet53 is 1–2–8–8–4 respectively.
  • In order to get a better speed/accuracy trade-off, the first CSP stage is converted into original Darknet residual layer.

2.2. Neck

Computational blocks of reversed Dark layer (SPP) and reversed CSP dark layers (SPP)
  • This new update effectively cuts down 40% of computation.

2.3. SPP

  • Also, SPP module (SPPNet) is now inserted in the middle position of the first computation list group of the CSPPAN.

3. Scaled-YOLOv4: YOLOv4-tiny

Computational block of YOLOv4-tiny
  • The CSPOSANet with partial in computational block (PCB) architecture is used to form the backbone of YOLOv4.
  • g=b=2 is set as the growth rate and make it grow to b/2+kg=2b at the end.
  • Through calculation, k=3 is deduced, and its architecture is shown as above.

4. Scaled-YOLOv4: YOLOv4-large

Architecture of YOLOv4-large, including YOLOv4-P5, YOLOv4-P6, and YOLOv4-P7. The dashed arrow means replace the corresponding CSPUp block by CSPSPP block.
  • A fully CSP-ized model YOLOv4-P5 is designed and can be scaled up to YOLOv4-P6 and YOLOv4-P7.
  • Compound scaling on size^input, #stage is performed.
  • The depth scale of each stage to 2^(d_si), and d_s to [1, 3, 15, 15, 7, 7, 7].
  • The inference time is further used as constraint to perform additional width scaling.
  • YOLOv4-P6 can reach real-time performance at 30 FPS video when the width scaling factor is equal to 1.
  • YOLOv4-P7 can reach real-time performance at 16 FPS video when the width scaling factor is equal to 1.25.

5. Ablation Study

  • MSCOCO 2017 object detection dataset is used.
  • All models are trained from scratched.

5.1. Ablation Study on CSP-ized Model

Ablation study of CSP-ized models @ 608 x 608
  • LeakyReLU (Leaky) and Mish activation function are tried.

5.2. Ablation Study on YOLOv4-tiny

Ablation study of partial at different position in computational block.
  • The proposed COSA can get a higher AP.

5.3. Ablation Study on YOLOv4-large

Ablation study of training schedule with/without fine-tuning

6. Scaled-YOLOv4 for Object Detection Results

6.1. Large-Model

Comparison of state-of-the-art object detectors
  • When YOLOv4-P5 is compared with EfficientDet-D5 with the same accuracy (51.8% vs 51.5%), the inference speed is 2.9 times.
  • The situation is similar to the comparisons between YOLOv4-P6 vs EfficientDet-D7 (54.5% vs 53.7%) and YOLOv4-P7 vs EfficientDet-D7x (55.5% vs 55.1%). In both cases, YOLOv4-P6 and YOLOv4-P7 are, respectively, 3.7 times and 2.5 times faster in terms of inference speed.
Results of YOLOv4-large models with test-time augmentation (TTA)
YOLOv4-P7 as “once-for-all” model using AP difference.
  • YOLOv4-P7\P7 and YOLOv4-P7\P7\P6 represent the model which has removed {P7} and {P7, P6} stages from the trained YOLOv4-P7.

6.2. Tiny-Model

Comparison of state-of-the-art tiny models.
FPS of YOLOv4-tiny on embedded devices
  • If FP16 and batch size 4 are adopted to test Xavier AGX and Xavier NX, the frame rate can reach 380 FPS and 199 FPS respectively.
  • In addition, if one uses TensorRT FP16 to run YOLOv4-tiny on general GPU RTX 2080ti, when the batch size respectively equals to 1 and 4, the respective frame rate can reach 773 FPS and 1774 FPS, which is extremely fast.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store