Review — YOLOv4: Optimal Speed and Accuracy of Object Detection

YOLOv4, Using Better backbone CSPNet, Bag of Freebies (BoF) and Bag of Specials (BoS), Outperforms EfficientDet, ASFF, NAS-FPN, CenterNet, CornerNet, etc.

8 min readAug 15, 2021

YOLOv4 (YouTube link provided from Author’s Medium, link at the bottom)

In this story, YOLOv4: Optimal Speed and Accuracy of Object Detection, (YOLOv4), by Institute of Information Science Academia Sinica, is reviewed. In this paper:

YOLOv4 uses CSPDarknet53 as the backbone, SPP and PANet as the neck, and YOLOv3 as the head.
Plenty of tools are tested, in order to select the best set of tools for YOLOv4 to boost the accuracy while keeping fast inference. These tools are called Bag of Freebies (BoF) and Bag of Specials (BoS).
It is then extended as Scaled-YOLOv4 in 2021 CVPR.

This is a paper in 2020 arXiv with over 1300 citations. (YOLOv4 authors are not those in YOLOv3.) (Sik-Ho Tsang @ Medium)

**YOLOv4 runs twice faster than EfficientDet with comparable performance. Improves** **YOLOv3’s AP and FPS by 10% and 12%, respectively.**

Outline

YOLOv4: Network Architecture
Additional Improvements
Bag of Freebies (BoF) and Bag of Specials (BoS) for Backbone and Detector
Ablation Study
SOTA Comparison

1. YOLOv4: Network Architecture

**YOLOv4: Network Architecture (Figure from** https://aiacademy.tw/yolo-v4-intro/)

1.1. Selection of Backbone

It is found that CSPResNext50 is considerably better compared to CSPDarknet53 in terms of object classification on the ILSVRC2012 (ImageNet) dataset.
However, conversely, the CSPDarknet53 is better compared to CSPResNext50 in terms of detecting objects on the MS COCO dataset.
(This CSPNet can be applied to different backbones, such as ResNet, ResNeXt, and DenseNet, to become CSPResNet, CSPResNeXt, and CSPDenseNet.)

1.2. Selection of Additional Blocks

In contrast to the image classification classifier, the detector requires the following:

Higher input network size (resolution) — for detecting multiple small-sized objects.
More layers — for a higher receptive field to cover the increased size of input network.
More parameters — for greater capacity of a model to detect multiple objects of different sizes in a single image.

The influence of the receptive field with different sizes is summarized as follows:

Up to the object size — allows viewing the entire object.
Up to network size — allows viewing the context around the object
Exceeding the network size — increases the number of connections between the image point and the final activation.

Based on the above considerations:

The SPP block, proposed in SPPNet, is used in the CSPDarknet53, since it significantly increases the receptive field.
PANet is used as the method of parameter aggregation from different backbone levels for different detector levels, instead of the FPN used in YOLOv3.
Cross-GPU Batch Normalization (CGBN or SyncBN) or expensive specialized devices are NOT used, such that a conventional graphic processor e.g. GTX 1080Ti or RTX 2080Ti can also train the model.

Finally. CSPDarknet53, SPP additional module, PANet path-aggregation neck, and YOLOv3 (anchor based) head are chosen as the architecture of YOLOv4.

2. Additional Improvements

2.1. Mosaic

**Mosaic represents a new method of data augmentation**

Mosaic represents a new data augmentation method that mixes 4 training images.
Thus 4 different contexts are mixed, while CutMix mixes only 2 input images. This allows detection of objects outside their normal context.
In addition, batch normalization calculates activation statistics from 4 different images on each layer. This significantly reduces the need for a large mini-batch size.

2.2. Self-Adversarial Training (SAT)

Self-Adversarial Training (SAT) also represents a new data augmentation technique that operates in 2 forward backward stages.
In the 1st stage the neural network alters the original image instead of the network weights. In this way the neural network executes an adversarial attack on itself, altering the original image to create the deception that there is no desired object on the image.
In the 2nd stage, the neural network is trained to detect an object on this modified image in the normal way.

2.3. Cross mini-Batch Normalization (CmBN)

Cross mini-Batch Normalization (CmBN) represents a Cross-Iteration Batch Normalization (CBN) modified version, as shown above.
This collects statistics only between mini-batches within a single batch.

2.4. Modified Spatial Attention Module (SAM)

Spatial Attention Module (SAM) in CBAM is modified as above.
Max pooling and average pooling are replaced by convolution.

2.5. Modified Path Aggregation Network (PAN)

Addition is replaced by concatentation for feature maps fusion.

3. Bag of Freebies (BoF) and Bag of Specials (BoS) for Backbone and Detector

Object detection network consists of backbone and detector.
Both backbone and detector can use BoF and BoS.
BoF are the techniques without affecting the architecture.
BoS are the techniques that will affect the architecture.
There are a lot of choices for BoF and BoS. And at the end, authors decided the below to use the below BoF and BoS based on the performance:

3.1. BoF for Backbone

CutMix and Mosaic data augmentation, DropBlock regularization, Class label smoothing.

3.2. BoS for Backbone

Mish activation, Cross-stage partial connections (CSP), Multi-input weighted residual connections (MiWRC).

3.3. BoF for Detector

CIoU-loss, CmBN, DropBlock regularization, Mosaic data augmentation, Self-Adversarial Training, Eliminate grid sensitivity, Using multiple anchors for a single ground truth, Cosine annealing scheduler, Optimal hyperparameters, Random training shapes.

3.4. BoS for Detector

Mish activation, SPP-block, SAM-block, PAN path-aggregation block, DIoU-NMS.

4. Ablation Study

4.1. Influence of Different Features on Classifier Training (ImageNet)

**Various methods of data augmentation are tested to choose the best one**

**Influence of BoF and** **Mish** **on the** **CSPResNeXt-50 classifier accuracy**

**Influence of BoF and** **Mish** **on the CSPDarknet-53 classifier accuracy**

Two classifiers CSPResNeXt and CSPDarknet are tried
The proposed BoF-backbone (Bag of Freebies) for classifier training includes the following: CutMix and Mosaic data augmentation and Class label smoothing.
Mish activation is act as a complementary option.

Though CSPResNeXt seems to have higher accuracy than CSPDarkNet in image classification task, it is later on found that CPSDarkNet is better in object detection task, which will be shown below.

4.2. Influence of BoF on Detector Training (MS COCO)

**Ablation Studies of Bag-of-Freebies. (CSPResNeXt50**-**PANet**-**SPP, 512×512).**

The BoF list is significantly expanded through studying different features that increase the detector accuracy without affecting FPS:

S, Eliminate Grid Sensitivity: The bounding box related equation bx = σ(tx)+cx; by = σ(ty)+cy, is used in YOLOv3. However, extremely high tx absolute values are required for the bx value approaching the cx or cx+1 values. A factor exceeding 1.0 is multiplying the sigmoid, so eliminating the effect of grid on which the object is undetectable.
M, Mosaic data augmentation: Using the 4-image mosaic during training instead of single image.
IT, IoU threshold: Using multiple anchors for a single ground truth IoU (truth, anchor) > IoU threshold.
GA, Genetic algorithms: Selecting the optimal hyperparameters during network training on the first 10% of time periods.
LS, Class label smoothing: Using class label smoothing for sigmoid activation.
CBN, CmBN: using Cross mini-Batch Normalization for collecting statistics inside the entire batch, instead of collecting statistics inside a single mini-batch.
CA, Cosine annealing scheduler: Altering the learning rate during sinusoid training.
DM, Dynamic mini-batch size: Automatic increase of mini-batch size during small resolution training by using Random training shapes.
OA, Optimized Anchors: Using the optimized anchors for training with the 512×512 network resolution.
GIoU, CIoU, DIoU, MSE: Different loss algorithms for bounded box regression.

Finally, S+M+IT+GA+OA+GIoU/CIoU obtains the highest accuracy.

4.3. Influence of BoS on Detector Training (MS COCO)

**Ablation Studies of Bag-of-Specials. (Size 512×512)**

For BoS, the detector gets best performance when using SPP, PAN, and SAM.

4.4. Influence of Different Backbones and Pretrained Weightings on Detector Training

**Using different classifier pre-trained weightings for detector training (all other training parameters are similar in all models).**

The model characterized with the best classification accuracy is not always the best in terms of the detector accuracy.
Although classification accuracy of CSPResNeXt-50 models trained with different features is higher compared to CSPDarknet53 models, the CSPDarknet53 model shows higher accuracy in terms of object detection.
Using BoF and Mish for the CSPResNeXt50 classifier training increases its classification accuracy, but further application of these pre-trained weightings for detector training reduces the detector accuracy.
Using BoF and Mish for the CSPDarknet53 classifier training increases the accuracy of both the classifier and the detector which uses this classifier pre-trained weightings.

The net result is that backbone CSPDarknet53 is more suitable for the detector than for CSPResNeXt50.

4.5. Influence of Different Minibatch Size on Detector Training

After adding BoF and BoS training strategies, the mini-batch size has almost no effect on the detector’s performance.

This result shows that after the introduction of BoF and BoS, it is no longer necessary to use expensive GPUs for training. In other words, anyone can use only a conventional GPU to train an excellent detector.

5. SOTA Comparison

**Comparison of the speed and accuracy of different object detectors Using one GPU of either Maxwell/Pascal/Volta Type**

As seen above, with different types of GPU, YOLOv4 are located on the Pareto optimality curve and are superior to the fastest and most accurate detectors in terms of both speed and accuracy.

YOLOv4 outperforms EfficientDet, ASFF, NAS-FPN, CenterNet, CornerNet, etc.

**Comparison of the speed and accuracy of different object detectors on the MS COCO dataset (testdev 2017)**

The above table lists the frame rate comparison results of using Maxwell GPU. (There are also tables for Pascal and Volta GPU. Please feel free to read the paper.)
Real-time detectors with FPS 30 or higher are highlighted in blue.

YOLOv4 obtains the highest AP while achieving real-time FPS.

A large number of features, and selected for use such of them for improving the accuracy of both the classifier and the detector. These features can be used as best-practice for future studies and developments.

Later on, Authors published Scaled-YOLOv4 in 2021 CVPR. Hope I can review it in the future. (Though there are already many articles about it on the Internet..)