Review — Res2Net: A New Multi-scale Backbone Architecture

Res2Net, Enhances ResNet Basic Block

Multi-scale representations are essential for various vision tasks.
  • Res2Net is proposed, which constructs hierarchical residual-like connections within one single residual block. It represents multi-scale features at a granular level and increases the range of receptive fields for each network layer.
  • The proposed Res2Net block can be plugged into the state-of-the-art backbone CNN models, e.g., ResNet, ResNeXt, and DLA.


  1. Res2Net
  2. Results

1. Res2Net

1.1. Res2Net Block

Comparison between the bottleneck block and the proposed Res2Net module (the scale dimension s=4).
  • (a) Bottleneck: a basic building block in ResNet.
  • (b) Res2Net: After the 1×1 convolution, the feature maps are evenly split into s feature map subsets, denoted by xi, where i is from 1 to s.
  • Except for x1, each xi has a corresponding 3×3 convolution, denoted by Ki(). The output of Ki() is denoted as yi.
  • The feature subset xi is added with the output of Ki-1(), and then fed into Ki().
  • The below shows the equations for above proedures:
  • To better fuse information at different scales, all splits are concatenated and passed through a 1×1 convolution. The split and concatenation strategy can enforce convolutions to process features more effectively.
  • To reduce the number of parameters, the convolution is omitted for the first split, which can also be regarded as a form of feature reuse.

1.2. Integration with Modern Modules

The Res2Net module can be integrated with the dimension cardinality as in ResNeXt (replace conv with group conv) and SE blocks as in SENet.
  • The dimension cardinality indicates the number of groups within a filter. This dimension changes filters from single-branch to multi-branch.
  • The 3×3 convolution is replaced with the 3×3 group convolution (ResNeXt), where c indicates the number of groups.
  • SE block (SENet) is added right before the residual connections of the Res2Net module.

2. Results

2.1. ImageNet

Top-1 and Top-5 test error on the ImageNet dataset.
Top-1 and Top-5 test error (%) of deeper networks on the ImageNet dataset.
Top-1 and Top-5 test error (%) of Res2Net-50 with different scales on the ImageNet dataset. Parameter w is the width of filters, and s is the number of scale

2.2. CIFAR-100

Top-1 test error (%) and model size on the CIFAR-100 dataset. Parameter c indicates the value of cardinality, and w is the width of filters.
Test precision on the CIFAR-100 dataset with regard to the model size, by changing cardinality (ResNeXt-29), depth (ResNeXt), and scale (Res2Net-29).
  • For the case of scale s = 2, the model capacity is only increased by adding more parameters of 1×1 filters. Thus, the model performance of s = 2 is slightly worse than that of increasing cardinality.
  • However, the models with s = 5, 6 have limited performance gains, it maybe due to the fact that the image in the CIFAR dataset is too small (32×32) to have many scales.
Visualization of class activation mapping by Grad-CAM, using ResNet-50 and Res2Net-50 as backbone networks.

2.3. Other Visual Recognition Tasks

Other visual recognition tasks
  • Different frameworks are used for different tasks with Res2Net replacing the original blocks in the backbone.
Visualization of semantic segmentation results, using ResNet-101 and Res2Net-101 as backbone networks.
Examples of salient object detection results, using ResNet-50 and Res2Net-50 as backbone networks, respectively.



PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store