Review — VoVNet/OSANet: An Energy and GPU-Computation Efficient Backbone Network for Real-Time Object Detection

OSA Module, Better Module Design Than Dense Block in DenseNet, Outperforms Pelee, DenseNet, ResNet Backbones

Sik-Ho Tsang
5 min readAug 31, 2021

In this story, An Energy and GPU-Computation Efficient Backbone Network for Real-Time Object Detection, (VoVNet/OSANet), by ETRI, SS C&C, is reviewed. In this paper:

  • One-shot aggregation (OSA) module is designed which is more efficient than Dense Block in DenseNet.
  • By cascading OSA module, an efficient object detection network VoVNet is formed.
  • It is also named as OSANet and further discussed in Scaled-YOLOv4.

This is a paper in 2019 CVPR Workshop. (Sik-Ho Tsang @ Medium)

Outline

  1. One-Shot Aggregation (OSA) Module in VoVNet
  2. VoVNet: Network Architecture
  3. Experimental Results

1. One-Shot Aggregation (OSA) Module in VoVNet

Aggregation Methods

1.1. (a) Dense Block in DenseNet

  • Reducing FLOPs and model sizes does not always guarantee the reduction of GPU inference time and real energy consumption. Memory access cost (MAC) is calculated. The MAC of each convolutional layer is:
  • where k, h, w, ci, co, denote kernel size, height/width of input and output response, the channel size of input, and that of output response, respectively.

Dense connections induce high memory access cost (MAC) which is paid by energy and time.

The dense connection imposes the use of bottleneck structure which harms the efficiency of GPU parallel computation.

Also, dense connection makes later intermediate layer produce the features that are better but also similar to the features from former layers. In this case, the final layer is not required to learn to aggregate both features because they are representing redundant information.

1.2. (b) Proposed OSA Module in VoVNet

One-shot aggregation (OSA) module is designed to aggregate its feature in the last layer at once, as shown above.

  • It has much less MAC than that with dense block. Substituting dense block of DenseNet-40 to OSA module with 5 layers with 43 channels reduces MAC from 3.7M to 2.5M.
  • Also, OSA improves GPU computation efficiency. The input sizes of intermediate layers of OSA module are constant. Hence, it is unnecessary to adopt additional 1×1 conv bottleneck to reduce dimension. The means it consists of fewer layers.

2. VoVNet: Network Architecture

VoVNet: Network Architecture

There are two types of VoVNet: lightweight network, e.g., VoVNet-27-slim, and large-scale network, e.g., VoVNet-39/57.

  • VoVNet consists of a stem block including 3 convolution layers and 4 stages of OSA modules with output stride 32.
  • An OSA module is comprised of 5 convolution layers with the same input/output channel for minimizing MAC, as mentioned above.
  • Whenever the stage goes up, the feature map is downsampled by 3×3 max pooling with stride 2.
  • VoVNet-39/57 have more OSA modules at the 4th and 5th stage where downsampling is done in the last module.

3. Experimental Results

3.1. Lightweight Models

Comparisons of lightweight models in terms of the computation and energy efficiency
  • VoVNet always appears at the corner with better performance and efficiency.
Comparison with lightweight object detectors on VOC 2007 test set
  • DSOD is used as the detector network with VoVNet as backbone.

The proposed VoVNet-27-slim based DSOD300 achieves 74.87%, which is better than DenseNet-67 based one even with comparable parameters.

  • In addition to accuracy, the inference speed of VoVNet-27-slim is also two times faster than that of the counterpart with comparable FLOPs.
  • Pelee has similar inference speed with DSOD with DenseNet-67. WIt is conjectured that decomposing a dense block into smaller fragmented layers deteriorates GPU computing parallelism.

The VoVNet-27-slim based DSOD also outperforms Pelee by a large margin of 3.97% at much faster speed.

3.2. Large-Scale Models

Comparisons of large-scale models on RefineDet320
Comparison backbone networks on RefineDet320 on COCO test-dev set
  • The generalization to large-scale VoVNet, e.g.,VoVNet-39/57, in RefineDet, is validated.
  • It is found that VoVNet and DenseNet obtain higher AP than ResNet on small and medium objects.

Furthermore, VoVNet improves 1.9%/1.2% small object AP gain from DenseNet121/161, which suggests that generating more features by OSA is better than generating deep features by dense connection on small object detection.

3.3. Mask R-CNN from Scratch

Detection and segmentation results using Mask R-CNN with Group Normalization (Group Norm, GN) trained from scratch for 3× schedule and evaluted on COCO val set.

For object detection task, with faster speed, VoVNet-39 obtains 2.2%/0.9% absolute AP gains compared to ResNet-50/101, respectively.

For instance segmentation task, VoVNet-39 also improves 1.6%/0.4% AP from ResNet-50/101.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.