Review — Res2Net: A New Multi-scale Backbone Architecture

Res2Net, Enhances ResNet Basic Block

4 min readJan 23, 2023

--

**Multi-scale representations are essential for various vision tasks.**

Res2Net: A New Multi-scale Backbone Architecture,
Res2Net, by Nankai University, UC Merced, and Oxford University,
2021 TPAMI, Over 1300 Citations (Sik-Ho Tsang ＠ Medium)
Image Classification, ResNet

Res2Net is proposed, which constructs hierarchical residual-like connections within one single residual block. It represents multi-scale features at a granular level and increases the range of receptive fields for each network layer.
The proposed Res2Net block can be plugged into the state-of-the-art backbone CNN models, e.g., ResNet, ResNeXt, and DLA.

Outline

Res2Net
Results

1. Res2Net

1.1. Res2Net Block

**Comparison between the bottleneck block and the proposed Res2Net module (the scale dimension s=4).**

(a) Bottleneck: a basic building block in ResNet.
(b) Res2Net: After the 1×1 convolution, the feature maps are evenly split into s feature map subsets, denoted by xi, where i is from 1 to s.
Except for x1, each xi has a corresponding 3×3 convolution, denoted by Ki(). The output of Ki() is denoted as yi.
The feature subset xi is added with the output of Ki-1(), and then fed into Ki().
The below shows the equations for above proedures:

Due to the combinatorial explosion effect, the output of the Res2Net module contains a different number and different combination of receptive field sizes/scales.

To better fuse information at different scales, all splits are concatenated and passed through a 1×1 convolution. The split and concatenation strategy can enforce convolutions to process features more effectively.
To reduce the number of parameters, the convolution is omitted for the first split, which can also be regarded as a form of feature reuse.

1.2. Integration with Modern Modules

**The Res2Net module can be integrated with the dimension cardinality as in** **ResNeXt** **(replace conv with group conv) and SE blocks as in** **SENet**.

The dimension cardinality indicates the number of groups within a filter. This dimension changes filters from single-branch to multi-branch.
The 3×3 convolution is replaced with the 3×3 group convolution (ResNeXt), where c indicates the number of groups.
SE block (SENet) is added right before the residual connections of the Res2Net module.

2. Results

2.1. ImageNet

**Top-1 and Top-5 test error on the ImageNet dataset.**

Compared with these strong baselines, models integrated with the Res2Net module still have consistent performance gains.

**Top-1 and Top-5 test error (%) of deeper networks on the ImageNet dataset.**

The proposed module with additional dimension scale can be integrated with deeper models to achieve better performance.

**Top-1 and Top-5 test error (%) of Res2Net-50 with different scales on the ImageNet dataset.** Parameter w is the width of filters, and s is the number of scale

The performance increases with the increase of scale.

2.2. CIFAR-100

**Top-1 test error (%) and model size on the CIFAR-100 dataset.** Parameter c indicates the value of cardinality, and w is the width of filters.

The proposed method surpasses the baseline and other methods with fewer parameters.
Integrate the recently proposed SE block into the structure, with fewer parameters, the proposed method still outperforms the ResNeXt-29, 8c64w-SE baseline.

**Test precision on the CIFAR-100 dataset with regard to the model size, by changing cardinality (ResNeXt-29), depth (ResNeXt), and scale (Res2Net-29).**

For the case of scale s = 2, the model capacity is only increased by adding more parameters of 1×1 filters. Thus, the model performance of s = 2 is slightly worse than that of increasing cardinality.

For s = 3, 4, the combination effects of the hierarchical residual-like structure produce a rich set of equivalent scales, resulting in significant performance gains.

However, the models with s = 5, 6 have limited performance gains, it maybe due to the fact that the image in the CIFAR dataset is too small (32×32) to have many scales.

**Visualization of class activation mapping by Grad-CAM, using** **ResNet-50 and Res2Net-50 as backbone networks.**

Due to stronger multi-scale ability, the Res2Net has activation maps that tend to cover the whole object on big objects such as ‘bulbul’, ‘mountain dog’, ‘ballpoint’, and ‘mosque’, while activation maps of ResNet only cover parts of objects.