Review — Res2Net: A New Multi-scale Backbone Architecture
- Res2Net is proposed, which constructs hierarchical residual-like connections within one single residual block. It represents multi-scale features at a granular level and increases the range of receptive fields for each network layer.
- The proposed Res2Net block can be plugged into the state-of-the-art backbone CNN models, e.g., ResNet, ResNeXt, and DLA.
1.1. Res2Net Block
- (a) Bottleneck: a basic building block in ResNet.
- (b) Res2Net: After the 1×1 convolution, the feature maps are evenly split into s feature map subsets, denoted by xi, where i is from 1 to s.
- Except for x1, each xi has a corresponding 3×3 convolution, denoted by Ki(). The output of Ki() is denoted as yi.
- The feature subset xi is added with the output of Ki-1(), and then fed into Ki().
- The below shows the equations for above proedures:
Due to the combinatorial explosion effect, the output of the Res2Net module contains a different number and different combination of receptive field sizes/scales.
- To better fuse information at different scales, all splits are concatenated and passed through a 1×1 convolution. The split and concatenation strategy can enforce convolutions to process features more effectively.
- To reduce the number of parameters, the convolution is omitted for the first split, which can also be regarded as a form of feature reuse.
1.2. Integration with Modern Modules
- The dimension cardinality indicates the number of groups within a filter. This dimension changes filters from single-branch to multi-branch.
- The 3×3 convolution is replaced with the 3×3 group convolution (ResNeXt), where c indicates the number of groups.
- SE block (SENet) is added right before the residual connections of the Res2Net module.
Compared with these strong baselines, models integrated with the Res2Net module still have consistent performance gains.
The proposed module with additional dimension scale can be integrated with deeper models to achieve better performance.
The performance increases with the increase of scale.
The proposed method surpasses the baseline and other methods with fewer parameters.
Integrate the recently proposed SE block into the structure, with fewer parameters, the proposed method still outperforms the ResNeXt-29, 8c64w-SE baseline.
- For the case of scale s = 2, the model capacity is only increased by adding more parameters of 1×1 filters. Thus, the model performance of s = 2 is slightly worse than that of increasing cardinality.
For s = 3, 4, the combination effects of the hierarchical residual-like structure produce a rich set of equivalent scales, resulting in significant performance gains.
- However, the models with s = 5, 6 have limited performance gains, it maybe due to the fact that the image in the CIFAR dataset is too small (32×32) to have many scales.
Due to stronger multi-scale ability, the Res2Net has activation maps that tend to cover the whole object on big objects such as ‘bulbul’, ‘mountain dog’, ‘ballpoint’, and ‘mosque’, while activation maps of ResNet only cover parts of objects.
2.3. Other Visual Recognition Tasks
- Different frameworks are used for different tasks with Res2Net replacing the original blocks in the backbone.
Res2Net based model has a consistent improvement compared with its counterparts on all datasets.
The Res2Net based method tends to segment all parts of objects regardless of object size.
Some visual comparisons of salient object detection results on challenging examples are illustrated in the above figure.
[2021 TPAMI] [Res2Net]
Res2Net: A New Multi-scale Backbone Architecture