Review — VoVNet/OSANet: An Energy and GPU-Computation Efficient Backbone Network for Real-Time Object Detection

OSA Module, Better Module Design Than Dense Block in DenseNet, Outperforms Pelee, DenseNet, ResNet Backbones

5 min readAug 31, 2021

In this story, An Energy and GPU-Computation Efficient Backbone Network for Real-Time Object Detection, (VoVNet/OSANet), by ETRI, SS C&C, is reviewed. In this paper:

One-shot aggregation (OSA) module is designed which is more efficient than Dense Block in DenseNet.
By cascading OSA module, an efficient object detection network VoVNet is formed.
It is also named as OSANet and further discussed in Scaled-YOLOv4.

This is a paper in 2019 CVPR Workshop. (Sik-Ho Tsang @ Medium)

Outline

One-Shot Aggregation (OSA) Module in VoVNet
VoVNet: Network Architecture
Experimental Results

1. One-Shot Aggregation (OSA) Module in VoVNet

1.1. (a) Dense Block in DenseNet

Reducing FLOPs and model sizes does not always guarantee the reduction of GPU inference time and real energy consumption. Memory access cost (MAC) is calculated. The MAC of each convolutional layer is:

where k, h, w, ci, co, denote kernel size, height/width of input and output response, the channel size of input, and that of output response, respectively.

Dense connections induce high memory access cost (MAC) which is paid by energy and time.
The dense connection imposes the use of bottleneck structure which harms the efficiency of GPU parallel computation.
Also, dense connection makes later intermediate layer produce the features that are better but also similar to the features from former layers. In this case, the final layer is not required to learn to aggregate both features because they are representing redundant information.

1.2. (b) Proposed OSA Module in VoVNet

One-shot aggregation (OSA) module is designed to aggregate its feature in the last layer at once, as shown above.

It has much less MAC than that with dense block. Substituting dense block of DenseNet-40 to OSA module with 5 layers with 43 channels reduces MAC from 3.7M to 2.5M.
Also, OSA improves GPU computation efficiency. The input sizes of intermediate layers of OSA module are constant. Hence, it is unnecessary to adopt additional 1×1 conv bottleneck to reduce dimension. The means it consists of fewer layers.

2. VoVNet: Network Architecture

There are two types of VoVNet: lightweight network, e.g., VoVNet-27-slim, and large-scale network, e.g., VoVNet-39/57.

VoVNet consists of a stem block including 3 convolution layers and 4 stages of OSA modules with output stride 32.
An OSA module is comprised of 5 convolution layers with the same input/output channel for minimizing MAC, as mentioned above.
Whenever the stage goes up, the feature map is downsampled by 3×3 max pooling with stride 2.
VoVNet-39/57 have more OSA modules at the 4th and 5th stage where downsampling is done in the last module.

3. Experimental Results

3.1. Lightweight Models

**Comparisons of lightweight models in terms of the computation and energy efficiency**

VoVNet always appears at the corner with better performance and efficiency.

**Comparison with lightweight object detectors on VOC 2007 test set**

DSOD is used as the detector network with VoVNet as backbone.

The proposed VoVNet-27-slim based DSOD300 achieves 74.87%, which is better than DenseNet-67 based one even with comparable parameters.

In addition to accuracy, the inference speed of VoVNet-27-slim is also two times faster than that of the counterpart with comparable FLOPs.
Pelee has similar inference speed with DSOD with DenseNet-67. WIt is conjectured that decomposing a dense block into smaller fragmented layers deteriorates GPU computing parallelism.

The VoVNet-27-slim based DSOD also outperforms Pelee by a large margin of 3.97% at much faster speed.

3.2. Large-Scale Models

**Comparisons of large-scale models on** **RefineDet320**

**Comparison backbone networks on** **RefineDet320 on COCO test-dev set**

The generalization to large-scale VoVNet, e.g.,VoVNet-39/57, in RefineDet, is validated.
It is found that VoVNet and DenseNet obtain higher AP than ResNet on small and medium objects.

Furthermore, VoVNet improves 1.9%/1.2% small object AP gain from DenseNet121/161, which suggests that generating more features by OSA is better than generating deep features by dense connection on small object detection.

3.3. Mask R-CNN from Scratch

**Detection and segmentation results using** **Mask R-CNN** **with Group Normalization (Group Norm, GN**) **trained from scratch for 3× schedule and evaluted on COCO val set.**

VoVNet backbone is applied to Mask R-CNN with Group Norm (GN).
DenseNet cannot be trained even on the 32GB V100 GPUs due to extreme memory footprint of DenseNet and larger input size of Mask R-CNN.

For object detection task, with faster speed, VoVNet-39 obtains 2.2%/0.9% absolute AP gains compared to ResNet-50/101, respectively.
For instance segmentation task, VoVNet-39 also improves 1.6%/0.4% AP from ResNet-50/101.

Reference

[2019 CVPRW] [VoVNet/OSANet]
An Energy and GPU-Computation Efficient Backbone Network for Real-Time Object Detection

Object Detection

2014-2017: …
2018: [YOLOv3] [Cascade R-CNN] [MegDet] [StairNet] [RefineDet] [CornerNet] [Pelee & PeleeNet]
2019: [DCNv2] [Rethinking ImageNet Pre-training] [GRF-DSOD & GRF-SSD] [CenterNet] [Grid R-CNN] [NAS-FPN] [ASFF] [Bag of Freebies] [VoVNet/OSANet]
2020: [EfficientDet] [CSPNet] [YOLOv4] [SpineNet]

Review — VoVNet/OSANet: An Energy and GPU-Computation Efficient Backbone Network for Real-Time Object Detection

OSA Module, Better Module Design Than Dense Block in DenseNet, Outperforms Pelee, DenseNet, ResNet Backbones

Outline

1. One-Shot Aggregation (OSA) Module in VoVNet

1.1. (a) Dense Block in DenseNet

1.2. (b) Proposed OSA Module in VoVNet

2. VoVNet: Network Architecture

3. Experimental Results

3.1. Lightweight Models

3.2. Large-Scale Models

3.3. Mask R-CNN from Scratch

Reference

Object Detection

My Other Previous Paper Readings

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Sik-Ho Tsang

No responses yet