Review — CSPNet: A New Backbone That Can Enhance Learning Capability of CNN

CSPNet (CSPDenseNet, CSPResNet & CSPResNeXt), Later on Used in YOLOv4 and Scaled-YOLOv4

Sik-Ho Tsang
6 min readAug 21, 2021
CSPNet not only reduces computation cost and memory usage of the networks, but also benefit on inference speed and accuracy.

In this story, CSPNet: A New Backbone That Can Enhance Learning Capability of CNN, (CSPNet), by Institute of Information Science Academia Sinica, Elan Microelectronics Corporation, and National Chiao Tung University, is reviewed. In this paper:

  • Cross Stage Partial Network (CSPNet) is designed, to attribute the problem to the duplicate gradient information within network optimization, complexity can be largely reduced while maintaining the accuracy.
  • It can be applied to various networks such as DenseNet, ResNeXt and ResNet. Later on, this CSPNet is used in YOLOv4 and Scaled-YOLOv4.

This is a paper in 2020 CVPR Workshop with over 200 citations. (

@ Medium).


  1. Duplicate Gradient Information in DenseNet
  2. CSPNet (CSPDenseNet, CSPResNet & CSPResNeXt)
  3. Exact Fusion Models (EFM)
  4. Ablation Study
  5. SOTA Comparison

1. Duplicate Gradient Information in DenseNet

Dense Block in DenseNet
  • In DenseNet, the output of the ith dense layer will be concatenated with the input of the ith dense layer. This concatenated outcome becomes the input of the (i+1)th dense layer:
  • If one makes use of a backpropagation algorithm to update weights, the equations of weight updating can be written as:

It is found that large amount of gradient information are reused for updating weights of different dense layers. This will result in different dense layers repeatedly learn copied gradient information.

2. CSPNet (CSPDenseNet, CSPResNet & CSPResNeXt)

2.1. Cross Stage Partial DenseNet (CSPDenseNet)

Cross Stage Partial DenseNet (CSPDenseNet)
  • CSPNet separates feature map of the base layer into two part, one part will go through a dense block and a transition layer; the other one part is then combined with transmitted feature map to the next stage.
  • The equations of feed-forward pass and weight updating of CSPDenseNet become:
  • The gradients coming from the dense layers are separately integrated.
  • On the other hand, the feature map that did not go through the dense layers is also separately integrated.
  • As to the gradient information for updating weights, both sides do not contain duplicate gradient information that belongs to other sides.

The proposed CSPDenseNet preserves the advantages of DenseNet’s feature reuse characteristics, but at the same time prevents an excessively amount of duplicate gradient information by truncating the gradient flow.

2.2. Partial Dense Block Variants

Different kind of feature fusion strategies
  • The purpose of designing partial transition layers is to maximize the difference of gradient combination.
  • Two variants are designed.
  • CSP (Fusion First): concatenate the feature maps generated by two parts, and then do transition operation.
  • If this strategy is adopted, a large amount of gradient information will be reused.
  • CSP (Fusion last): The output from the dense block will go through the transition layer and then do concatenation.
  • The gradient information will not be reused since the gradient flow is truncated.

2.3. Applying CSPNet to Other Architectures

Applying CSPNet to ResNe(X)t
  • CSPNet can be also easily applied to ResNet and ResNeXt.
  • Since only half of the feature channels are going through Res(X)Blocks, there is no need to introduce the bottleneck layer anymore.

3. Exact Fusion Model (EFM)

Different feature pyramid fusion strategies
  • EFM is proposed to capture an appropriate Field of View (FoV) for each anchor, which enhances the accuracy of the one-stage object detector.
  • EFM is proposed to better aggregate the initial feature pyramid.
  • Since the concatenated feature maps from the feature pyramid are enormous, it introduces a great amount of memory and computation cost. To alleviate the problem, the Maxout technique is incorporated to compress the feature maps.

4. Ablation Study

4.1. CSPNet on ImageNet

Ablation study of CSPNet on ImageNet
  • PeleeNet is used as baseline.
  • Different partial ratios γ and the different feature fusion strategies are used for ablation study.

Compared to the baseline PeleeNet, the proposed CSPPeleeNet achieves the best performance, it can cut down 13% computation, but at the same time upgrade the accuracy by 0.2%.

If the partial ratio is adjusted to γ = 0.25, the accuracy is improved by 0.8% and at the same time 3% computation is cut down.

4.2. EFM on MS COCO

Ablation study of EFM on MS COCO
  • CSPPeleeNet is used as backbone.
  • GIoU, SPP (in SPPNet), and SAM (in CBAM) are also applied to EFM for study.
  • PRN and ThunderNet are included for comparison.
  • Although the introduction of GIoU can upgrade AP by 0.7%, the AP50 is, however, significantly degraded by 2.7%. GIoU training is not used at the end.
  • Since SAM is better than SPP, EFM (SAM) is used as final architecture.
  • In addition, CSPPeleeNet with Swish activation is not considered as for consideration of hardware design acceleration.

5. SOTA Comparison

5.1. ImageNet Image Classification

Compare with state-of-the-art methods on ImageNet
  • There are a lot of findings here for each CSPNet model.
  • But basically, when the concept of CSPNet is introduced, the computational load is reduced at least by 10% and the accuracy is either remain unchanged or upgraded.
  • CSPResNeXt-50 all achieve the best result. As to the 10-crop test, CSPResNeXt-50 also outperforms Res2Net-50 [5] and Res2NeXt-50 [5].

5.2. MS COCO Object Detection

Compare with state-of-the-art methods on MS COCO Object Detection
  • If compared to object detectors running at 30~100 fps, CSPResNeXt50 with PANet (SPP) achieves the best performance in AP, AP50 and AP75. They receive, respectively, 38.4%, 60.6%, and 41.6% detection rates.
  • If compared to state-of-the-art LRF [38] under the input image size 512×512, CSPResNeXt50 with PANet (SPP) outperforms ResNet101 with LRF by 0.7% AP, 1.5% AP50 and 1.1% AP75.
  • If compared to object detectors running at 100~200 fps, CSPPeleeNet with EFM (SAM) boosts 12.1% AP50 at the same speed as Pelee [37] and increases 4.1% [37] at the same speed as CenterNet [45].
  • If compared to very fast object detectors such as ThunderNet [25], YOLOv3-tiny [29], and YOLOv3-tiny-PRN [35], the proposed CSPDenseNetb Reference with PRN is the fastest. It can reach 400 fps frame rate, i.e., 133 fps faster than ThunderNet with SNet49.
  • If compared to ThunderNet146, CSPPeleeNet Reference with PRN (3l) increases the frame rate by 19 fps while maintaining the same level of AP50.

5.3. Inference Rate

Inference rate on mobile GPU (mGPU) and CPU real-time object detectors (in fps)
  • The above experiments are based on NVIDIA Jetson TX2 and Intel Core i9–9900K with OpenCV DNN module. No model compression or quantization is applied.
  • Similarly, with CSPNet applied, it can achieve high fps and high AP50.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.