Review — SKNet: Selective Kernel Networks (Image Classification)
Attention Branch for Various Kernel Sizes, Outperforms SENet
In this story, Selective Kernel Networks, SKNet, by Nanjing University of Science and Technology, Momenta, Nanjing University, and Tsinghua University, is reviewed.
In the visual cortex, the receptive field (RF) sizes of neurons in the same area (e.g., V1 region) are different, which enables the neurons to collect multi-scale spatial information in the same processing stage.
RF sizes of neurons are not fixed but modulated by stimulus.
In this paper:
- A building block called Selective Kernel (SK) unit is designed, in which multiple branches with different kernel sizes are fused using softmax attention that is guided by the information in these branches.
- Different attentions on these branches yield different sizes of the effective receptive fields of neurons in the fusion layer.
- Multiple SK units are stacked to a deep network termed Selective Kernel Networks (SKNets).
This is a paper in 2019 CVPR with over 200 citations. (Sik-Ho Tsang @ Medium)
Outline
- Selective Kernel Convolution
- SKNet: Network Architecture
- Experimental Results
- Ablation Study
- Analysis and Interpretation
1. Selective Kernel Convolution
- “Selective Kernel” (SK) convolution enable the neurons to adaptively adjust their RF sizes.
- Specifically, we implement the SK convolution via three operators — Split, Fuse and Select. As shown above is a two-branch case.
1.1. Split
- Two transformations ~F and ^F are conducted on input feature map X, to output ~U and ^U, with kernel sizes 3 and 5, respectively.
- Both ~F and ^F are composed of efficient grouped/depthwise convolutions, Batch Normalization (BN) and ReLU in sequence.
- For further efficiency, the conventional convolution with a 5×5 kernel is replaced with the dilated convolution with a 3×3 kernel and dilation size 2.
1.2. Fuse
- First, the results from multiple (two in the above figure) branches are fused via an element-wise summation:
- Global average pooling is applied to get the global information. Specifically, the c-th element of s is calculated by shrinking U through spatial dimensions H×W:
- Then, sc is compress as a compact feature z. This is achieved by a simple fully connected (fc) layer, with the reduction of dimensionality for better efficiency:
- where δ is the ReLU, β is batch norm (BN).
- To study the impact of d on the efficiency of the model, we use a reduction ratio r to control its value:
- where L denotes the minimal value of d (L = 32 is a typical setting in the experiments).
1.3. Select
- A soft attention across channels is used to adaptively select different spatial scales of information, which is guided by the compact feature descriptor z. Specifically, a softmax operator is applied on the channel-wise digits:
- In the case of two branches, the matrix B is redundant because ac+bc=1.
- The final feature map V is obtained through the attention weights on various kernels:
- The above formula is for the two-branch case and one can easily deduce situations with more branches.
1.4. Three-Branch Case
- The figure and link above illustrate the 3-branch case.
2. SKNet: Network Architecture
2.1. Network Architecture
- Similar to the ResNeXt, the proposed SKNet is mainly composed of a stack of repeated bottleneck blocks, which are termed “SK units”.
- Each SK unit consists of a sequence of 1×1 convolution, SK convolution and 1×1 convolution.
- In general, all the large kernel convolutions in the original bottleneck blocks in ResNeXt are replaced by the proposed SK convolutions.
- SKNet-50 only leads to 10% increase in the number of parameters and 5% increase in computational cost, compared with ResNeXt-50.
2.2. Hyperparameters & SKNet Variants
- Three important hyper-parameters are determined in SKNet.
- Number of paths M determines the number of choices of different kernels to be aggregated.
- Group number G controls the cardinality of each path.
- Reduction ratio r that controls the number of parameters in the fuse operator.
- One typical setting of SK[M, G, r] is SK[2, 32, 16].
- 50-layer SKNet-50 has four stages with {3,4,6,3} SK units, respectively.
- SKNet-26 has {2,2,2,2} SK units, and SKNet-101 has {3,4,23,3} SK units.
- SK convolutions can be applied to other lightweight networks, e.g., MobileNet, ShuffleNet, in which 3×3 depthwise convolutions are extensively used.
3. Experimental Results
3.1. ImageNet
- SKNets consistently improve performance over the state-of-the-art attention-based CNNs under similar budgets.
- Remarkably, SKNet-50 outperforms ResNeXt-101 by above absolute 0.32%, although ResNeXt-101 is 60% larger in parameter and 80% larger in computation.
- With comparable or less complexity than InceptionNets, SKNets achieve above absolute 1.5% gain of performance, which demonstrates the superiority of adaptive aggregation for multiple kernels.
Using slightly less parameters, SKNets can obtain 0.3~0.4% gains to SENet counterparts in both 224×224 and 320×320 evaluations.
3.2. Selective Kernel vs. Depth/Width/Cardinality on ImageNet
- For fair comparison, the complexity of ResNeXt is also increased by changing its depth, width and cardinality, to match the complexity of SKNets.
- However, the improvement is marginal when going deeper (0.19% from ResNeXt-50 to ResNeXt-53) or wider (0.1% from ResNeXt-50 to ResNeXt-50 wider), or with slightly more cardinality (0.23% from ResNeXt-50 (32×4d) to ResNeXt-50 (36×4d)).
In contrast, SKNet-50 obtains 1.44% absolute improvement over the baseline ResNeXt-50, which indicates that SK convolution is very efficient.
3.3. Performance With Respect to the Number of Parameters
- It is seen that SKNets utilizes parameters more efficiently than these models. For instance, achieving 20.2 top-1 error, SKNet-101 needs 22% fewer parameters than DPN-98.
3.4. Lightweight Models
- SK convolutions not only boost the accuracy of baselines significantly but also perform better than SE.
- This indicates the great potential of the SK convolutions in applications on low-end devices.
3.5. CIFAR
4. Ablation Study
4.1. The Dilation D and Group Number G
- To study their effects, two-branch case is used and the setting is: 3×3 filter with dilation D=1 and group G=32 in the first kernel branch of SKNet-50.
- The optimal settings for the other branch are those with kernel size 5×5. It is proved beneficial to use different kernel sizes.
4.2. Combination of Different Kernels
- K5, K7 are stacks of 3×3 filters.
- When the number of paths M increases, in general the recognition error decreases.
- No matter M = 2 or 3, SK attention-based aggregation of multiple paths always achieves lower top-1 error than the simple aggregation method.
- Using SK attention, the performance gain of the model from M = 2 to M = 3 is marginal (the top-1 error decreases from 20.79% to 20.76%). For better trade-off between performance and efficiency, M = 2 is preferred.
5. Analysis and Interpretation
- The above figures show the attention values in all channels for two randomly samples in SK 3_4.
It is seen that in most channels, when the target object enlarges, the attention weight for the large kernel (5×5) increases, which suggests that the RF sizes of the neurons are adaptively getting larger, which agrees with the expectation.
- Another surprising pattern is found about the role of adaptive selection across depth.
- The larger the target object is, the more attention will be assigned to larger kernels by the Selective Kernel mechanism in low and middle level stages (e.g., SK 2_3, SK 3_4). However, at much higher layers (e.g., SK 5_3), all scale information is getting lost and such a pattern disappears.
- The importance of kernel 5×5 consistently and simultaneously increases when the scale of targets grows.
In the early parts of networks, the appropriate kernel sizes can be selected according to the semantic awareness of objects’ sizes, thus it efficiently adjusts the RF sizes of these neurons.
However, such pattern is not existed in the very high layers like SK 5_3, since for the high-level representation, “scale” is partially encoded in the feature vector, and the kernel size matters less compared to the situation in lower layers.
References
[2019 CVPR] [SKNet]
Selective Kernel Networks
[SKNet: 3-Branch Case]
【CV中的Attention机制】SKNet-SENet的进化版
Image Classification
1989–1998: [LeNet]
2012–2014: [AlexNet & CaffeNet] [Dropout] [Maxout] [NIN] [ZFNet] [SPPNet]
2015: [VGGNet] [Highway] [PReLU-Net] [STN] [DeepImage] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2]
2016: [SqueezeNet] [Inception-v3] [ResNet] [Pre-Activation ResNet] [RiR] [Stochastic Depth] [WRN] [Trimps-Soushen]
2017: [Inception-v4] [Xception] [MobileNetV1] [Shake-Shake] [Cutout] [FractalNet] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet] [DRN] [DPN] [Residual Attention Network] [IGCNet / IGCV1] [Deep Roots]
2018: [RoR] [DMRNet / DFN-MR] [MSDNet] [ShuffleNet V1] [SENet] [NASNet] [MobileNetV2] [CondenseNet] [IGCV2] [IGCV3] [FishNet] [SqueezeNext] [ENAS] [PNASNet] [ShuffleNet V2] [BAM] [CBAM] [MorphNet] [NetAdapt] [mixup] [DropBlock] [Group Norm (GN)]
2019: [ResNet-38] [AmoebaNet] [ESPNetv2] [MnasNet] [Single-Path NAS] [DARTS] [ProxylessNAS] [MobileNetV3] [FBNet] [ShakeDrop] [CutMix] [MixConv] [EfficientNet] [ABN] [SKNet]
2020: [Random Erasing (RE)] [SAOL]