Review — Res2Net: A New Multi-scale Backbone Architecture

Res2Net, Enhances ResNet Basic Block

Sik-Ho Tsang
4 min readJan 23


Multi-scale representations are essential for various vision tasks.

Res2Net: A New Multi-scale Backbone Architecture,
Res2Net, by Nankai University, UC Merced, and Oxford University,
2021 TPAMI, Over 1300 Citations (Sik-Ho Tsang @ Medium)
Image Classification, ResNet

  • Res2Net is proposed, which constructs hierarchical residual-like connections within one single residual block. It represents multi-scale features at a granular level and increases the range of receptive fields for each network layer.
  • The proposed Res2Net block can be plugged into the state-of-the-art backbone CNN models, e.g., ResNet, ResNeXt, and DLA.


  1. Res2Net
  2. Results

1. Res2Net

1.1. Res2Net Block

Comparison between the bottleneck block and the proposed Res2Net module (the scale dimension s=4).
  • (a) Bottleneck: a basic building block in ResNet.
  • (b) Res2Net: After the 1×1 convolution, the feature maps are evenly split into s feature map subsets, denoted by xi, where i is from 1 to s.
  • Except for x1, each xi has a corresponding 3×3 convolution, denoted by Ki(). The output of Ki() is denoted as yi.
  • The feature subset xi is added with the output of Ki-1(), and then fed into Ki().
  • The below shows the equations for above proedures:

Due to the combinatorial explosion effect, the output of the Res2Net module contains a different number and different combination of receptive field sizes/scales.

  • To better fuse information at different scales, all splits are concatenated and passed through a 1×1 convolution. The split and concatenation strategy can enforce convolutions to process features more effectively.
  • To reduce the number of parameters, the convolution is omitted for the first split, which can also be regarded as a form of feature reuse.

1.2. Integration with Modern Modules

The Res2Net module can be integrated with the dimension cardinality as in ResNeXt (replace conv with group conv) and SE blocks as in SENet.
  • The dimension cardinality indicates the number of groups within a filter. This dimension changes filters from single-branch to multi-branch.
  • The 3×3 convolution is replaced with the 3×3 group convolution (ResNeXt), where c indicates the number of groups.
  • SE block (SENet) is added right before the residual connections of the Res2Net module.

2. Results

2.1. ImageNet

Top-1 and Top-5 test error on the ImageNet dataset.

Compared with these strong baselines, models integrated with the Res2Net module still have consistent performance gains.

Top-1 and Top-5 test error (%) of deeper networks on the ImageNet dataset.

The proposed module with additional dimension scale can be integrated with deeper models to achieve better performance.

Top-1 and Top-5 test error (%) of Res2Net-50 with different scales on the ImageNet dataset. Parameter w is the width of filters, and s is the number of scale

The performance increases with the increase of scale.

2.2. CIFAR-100

Top-1 test error (%) and model size on the CIFAR-100 dataset. Parameter c indicates the value of cardinality, and w is the width of filters.

The proposed method surpasses the baseline and other methods with fewer parameters.

Integrate the recently proposed SE block into the structure, with fewer parameters, the proposed method still outperforms the ResNeXt-29, 8c64w-SE baseline.

Test precision on the CIFAR-100 dataset with regard to the model size, by changing cardinality (ResNeXt-29), depth (ResNeXt), and scale (Res2Net-29).
  • For the case of scale s = 2, the model capacity is only increased by adding more parameters of 1×1 filters. Thus, the model performance of s = 2 is slightly worse than that of increasing cardinality.

For s = 3, 4, the combination effects of the hierarchical residual-like structure produce a rich set of equivalent scales, resulting in significant performance gains.

  • However, the models with s = 5, 6 have limited performance gains, it maybe due to the fact that the image in the CIFAR dataset is too small (32×32) to have many scales.
Visualization of class activation mapping by Grad-CAM, using ResNet-50 and Res2Net-50 as backbone networks.

Due to stronger multi-scale ability, the Res2Net has activation maps that tend to cover the whole object on big objects such as ‘bulbul’, ‘mountain dog’, ‘ballpoint’, and ‘mosque’, while activation maps of ResNet only cover parts of objects.

2.3. Other Visual Recognition Tasks

Other visual recognition tasks
  • Different frameworks are used for different tasks with Res2Net replacing the original blocks in the backbone.

Res2Net based model has a consistent improvement compared with its counterparts on all datasets.

Visualization of semantic segmentation results, using ResNet-101 and Res2Net-101 as backbone networks.

The Res2Net based method tends to segment all parts of objects regardless of object size.

Examples of salient object detection results, using ResNet-50 and Res2Net-50 as backbone networks, respectively.

Some visual comparisons of salient object detection results on challenging examples are illustrated in the above figure.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.