Review — YOLOv4: Optimal Speed and Accuracy of Object Detection
In this story, YOLOv4: Optimal Speed and Accuracy of Object Detection, (YOLOv4), by Institute of Information Science Academia Sinica, is reviewed. In this paper:
- YOLOv4 uses CSPDarknet53 as the backbone, SPP and PANet as the neck, and YOLOv3 as the head.
- Plenty of tools are tested, in order to select the best set of tools for YOLOv4 to boost the accuracy while keeping fast inference. These tools are called Bag of Freebies (BoF) and Bag of Specials (BoS).
- It is then extended as Scaled-YOLOv4 in 2021 CVPR.
- YOLOv4: Network Architecture
- Additional Improvements
- Bag of Freebies (BoF) and Bag of Specials (BoS) for Backbone and Detector
- Ablation Study
- SOTA Comparison
1. YOLOv4: Network Architecture
1.1. Selection of Backbone
- It is found that CSPResNext50 is considerably better compared to CSPDarknet53 in terms of object classification on the ILSVRC2012 (ImageNet) dataset.
- However, conversely, the CSPDarknet53 is better compared to CSPResNext50 in terms of detecting objects on the MS COCO dataset.
- (This CSPNet can be applied to different backbones, such as ResNet, ResNeXt, and DenseNet, to become CSPResNet, CSPResNeXt, and CSPDenseNet.)
1.2. Selection of Additional Blocks
- In contrast to the image classification classifier, the detector requires the following:
- Higher input network size (resolution) — for detecting multiple small-sized objects.
- More layers — for a higher receptive field to cover the increased size of input network.
- More parameters — for greater capacity of a model to detect multiple objects of different sizes in a single image.
- The influence of the receptive field with different sizes is summarized as follows:
- Up to the object size — allows viewing the entire object.
- Up to network size — allows viewing the context around the object
- Exceeding the network size — increases the number of connections between the image point and the final activation.
- Based on the above considerations:
- The SPP block, proposed in SPPNet, is used in the CSPDarknet53, since it significantly increases the receptive field.
- PANet is used as the method of parameter aggregation from different backbone levels for different detector levels, instead of the FPN used in YOLOv3.
- Cross-GPU Batch Normalization (CGBN or SyncBN) or expensive specialized devices are NOT used, such that a conventional graphic processor e.g. GTX 1080Ti or RTX 2080Ti can also train the model.
2. Additional Improvements
- Mosaic represents a new data augmentation method that mixes 4 training images.
- Thus 4 different contexts are mixed, while CutMix mixes only 2 input images. This allows detection of objects outside their normal context.
- In addition, batch normalization calculates activation statistics from 4 different images on each layer. This significantly reduces the need for a large mini-batch size.
2.2. Self-Adversarial Training (SAT)
- Self-Adversarial Training (SAT) also represents a new data augmentation technique that operates in 2 forward backward stages.
- In the 1st stage the neural network alters the original image instead of the network weights. In this way the neural network executes an adversarial attack on itself, altering the original image to create the deception that there is no desired object on the image.
- In the 2nd stage, the neural network is trained to detect an object on this modified image in the normal way.
2.3. Cross mini-Batch Normalization (CmBN)
- Cross mini-Batch Normalization (CmBN) represents a Cross-Iteration Batch Normalization (CBN) modified version, as shown above.
- This collects statistics only between mini-batches within a single batch.
2.4. Modified Spatial Attention Module (SAM)
- Spatial Attention Module (SAM) in CBAM is modified as above.
- Max pooling and average pooling are replaced by convolution.
2.5. Modified Path Aggregation Network (PAN)
- Addition is replaced by concatentation for feature maps fusion.
3. Bag of Freebies (BoF) and Bag of Specials (BoS) for Backbone and Detector
- Object detection network consists of backbone and detector.
- Both backbone and detector can use BoF and BoS.
- BoF are the techniques without affecting the architecture.
- BoS are the techniques that will affect the architecture.
- There are a lot of choices for BoF and BoS. And at the end, authors decided the below to use the below BoF and BoS based on the performance:
3.1. BoF for Backbone
3.2. BoS for Backbone
- Mish activation, Cross-stage partial connections (CSP), Multi-input weighted residual connections (MiWRC).
3.3. BoF for Detector
- CIoU-loss, CmBN, DropBlock regularization, Mosaic data augmentation, Self-Adversarial Training, Eliminate grid sensitivity, Using multiple anchors for a single ground truth, Cosine annealing scheduler, Optimal hyperparameters, Random training shapes.
3.4. BoS for Detector
4. Ablation Study
4.1. Influence of Different Features on Classifier Training (ImageNet)
- Two classifiers CSPResNeXt and CSPDarknet are tried
- The proposed BoF-backbone (Bag of Freebies) for classifier training includes the following: CutMix and Mosaic data augmentation and Class label smoothing.
- Mish activation is act as a complementary option.
4.2. Influence of BoF on Detector Training (MS COCO)
- The BoF list is significantly expanded through studying different features that increase the detector accuracy without affecting FPS:
- S, Eliminate Grid Sensitivity: The bounding box related equation bx = σ(tx)+cx; by = σ(ty)+cy, is used in YOLOv3. However, extremely high tx absolute values are required for the bx value approaching the cx or cx+1 values. A factor exceeding 1.0 is multiplying the sigmoid, so eliminating the effect of grid on which the object is undetectable.
- M, Mosaic data augmentation: Using the 4-image mosaic during training instead of single image.
- IT, IoU threshold: Using multiple anchors for a single ground truth IoU (truth, anchor) > IoU threshold.
- GA, Genetic algorithms: Selecting the optimal hyperparameters during network training on the first 10% of time periods.
- LS, Class label smoothing: Using class label smoothing for sigmoid activation.
- CBN, CmBN: using Cross mini-Batch Normalization for collecting statistics inside the entire batch, instead of collecting statistics inside a single mini-batch.
- CA, Cosine annealing scheduler: Altering the learning rate during sinusoid training.
- DM, Dynamic mini-batch size: Automatic increase of mini-batch size during small resolution training by using Random training shapes.
- OA, Optimized Anchors: Using the optimized anchors for training with the 512×512 network resolution.
- GIoU, CIoU, DIoU, MSE: Different loss algorithms for bounded box regression.
Finally, S+M+IT+GA+OA+GIoU/CIoU obtains the highest accuracy.
4.3. Influence of BoS on Detector Training (MS COCO)
4.4. Influence of Different Backbones and Pretrained Weightings on Detector Training
- The model characterized with the best classification accuracy is not always the best in terms of the detector accuracy.
- Although classification accuracy of CSPResNeXt-50 models trained with different features is higher compared to CSPDarknet53 models, the CSPDarknet53 model shows higher accuracy in terms of object detection.
- Using BoF and Mish for the CSPResNeXt50 classifier training increases its classification accuracy, but further application of these pre-trained weightings for detector training reduces the detector accuracy.
- Using BoF and Mish for the CSPDarknet53 classifier training increases the accuracy of both the classifier and the detector which uses this classifier pre-trained weightings.
4.5. Influence of Different Minibatch Size on Detector Training
- After adding BoF and BoS training strategies, the mini-batch size has almost no effect on the detector’s performance.
This result shows that after the introduction of BoF and BoS, it is no longer necessary to use expensive GPUs for training. In other words, anyone can use only a conventional GPU to train an excellent detector.
5. SOTA Comparison
- As seen above, with different types of GPU, YOLOv4 are located on the Pareto optimality curve and are superior to the fastest and most accurate detectors in terms of both speed and accuracy.
- The above table lists the frame rate comparison results of using Maxwell GPU. (There are also tables for Pascal and Volta GPU. Please feel free to read the paper.)
- Real-time detectors with FPS 30 or higher are highlighted in blue.
YOLOv4 obtains the highest AP while achieving real-time FPS.
A large number of features, and selected for use such of them for improving the accuracy of both the classifier and the detector. These features can be used as best-practice for future studies and developments.
Later on, Authors published Scaled-YOLOv4 in 2021 CVPR. Hope I can review it in the future. (Though there are already many articles about it on the Internet..)
[2020 arXiv] [YOLOv4]
YOLOv4: Optimal Speed and Accuracy of Object Detection
[Author’s Medium Story about YOLOv4]
YOLOv4 — the most accurate real-time neural network on MS COCO dataset
2018: [YOLOv3] [Cascade R-CNN] [MegDet] [StairNet] [RefineDet] [CornerNet]
2019: [DCNv2] [Rethinking ImageNet Pre-training] [GRF-DSOD & GRF-SSD] [CenterNet] [Grid R-CNN] [NAS-FPN] [ASFF] [Bag of Freebies]
2020: [EfficientDet] [YOLOv4]