Review — SKNet: Selective Kernel Networks (Image Classification)

Attention Branch for Various Kernel Sizes, Outperforms SENet

In this story, Selective Kernel Networks, SKNet, by Nanjing University of Science and Technology, Momenta, Nanjing University, and Tsinghua University, is reviewed.

  • Different attentions on these branches yield different sizes of the effective receptive fields of neurons in the fusion layer.
  • Multiple SK units are stacked to a deep network termed Selective Kernel Networks (SKNets).

Outline

  1. Selective Kernel Convolution
  2. SKNet: Network Architecture
  3. Experimental Results
  4. Ablation Study
  5. Analysis and Interpretation

1. Selective Kernel Convolution

Selective Kernel Convolution
  • Specifically, we implement the SK convolution via three operators — Split, Fuse and Select. As shown above is a two-branch case.

1.1. Split

  • Two transformations ~F and ^F are conducted on input feature map X, to output ~U and ^U, with kernel sizes 3 and 5, respectively.
  • Both ~F and ^F are composed of efficient grouped/depthwise convolutions, Batch Normalization (BN) and ReLU in sequence.
  • For further efficiency, the conventional convolution with a 5×5 kernel is replaced with the dilated convolution with a 3×3 kernel and dilation size 2.

1.2. Fuse

  • First, the results from multiple (two in the above figure) branches are fused via an element-wise summation:
  • To study the impact of d on the efficiency of the model, we use a reduction ratio r to control its value:

1.3. Select

  • A soft attention across channels is used to adaptively select different spatial scales of information, which is guided by the compact feature descriptor z. Specifically, a softmax operator is applied on the channel-wise digits:
  • The final feature map V is obtained through the attention weights on various kernels:

1.4. Three-Branch Case

Image From 【CV中的Attention机制】SKNet-SENet的进化版: https://www.jianshu.com/p/20552b8da40d

2. SKNet: Network Architecture

2.1. Network Architecture

Left: ResNeXt-50 with a 32×4d, Middle: SENet-50 based on the ResNeXt-50 backbone, Right: SKNet-50
  • Each SK unit consists of a sequence of 1×1 convolution, SK convolution and 1×1 convolution.
  • In general, all the large kernel convolutions in the original bottleneck blocks in ResNeXt are replaced by the proposed SK convolutions.
  • SKNet-50 only leads to 10% increase in the number of parameters and 5% increase in computational cost, compared with ResNeXt-50.

2.2. Hyperparameters & SKNet Variants

  • Three important hyper-parameters are determined in SKNet.
  • Number of paths M determines the number of choices of different kernels to be aggregated.
  • Group number G controls the cardinality of each path.
  • Reduction ratio r that controls the number of parameters in the fuse operator.
  • One typical setting of SK[M, G, r] is SK[2, 32, 16].
  • 50-layer SKNet-50 has four stages with {3,4,6,3} SK units, respectively.
  • SKNet-26 has {2,2,2,2} SK units, and SKNet-101 has {3,4,23,3} SK units.
  • SK convolutions can be applied to other lightweight networks, e.g., MobileNet, ShuffleNet, in which 3×3 depthwise convolutions are extensively used.

3. Experimental Results

3.1. ImageNet

Comparisons to the state-of-the-arts under roughly identical complexity on ImageNet validation set
  • Remarkably, SKNet-50 outperforms ResNeXt-101 by above absolute 0.32%, although ResNeXt-101 is 60% larger in parameter and 80% larger in computation.
  • With comparable or less complexity than InceptionNets, SKNets achieve above absolute 1.5% gain of performance, which demonstrates the superiority of adaptive aggregation for multiple kernels.

3.2. Selective Kernel vs. Depth/Width/Cardinality on ImageNet

Comparisons on ImageNet validation set when the computational cost of model with more depth/width/cardinality is increased to match that of SKNet
  • However, the improvement is marginal when going deeper (0.19% from ResNeXt-50 to ResNeXt-53) or wider (0.1% from ResNeXt-50 to ResNeXt-50 wider), or with slightly more cardinality (0.23% from ResNeXt-50 (32×4d) to ResNeXt-50 (36×4d)).

3.3. Performance With Respect to the Number of Parameters

Top-1 Error (%) vs Number of Parameters on ImageNet validation set

3.4. Lightweight Models

Top-1 errors (%) for Lightweight Models on ImageNet validation set
  • This indicates the great potential of the SK convolutions in applications on low-end devices.

3.5. CIFAR

Top-1 errors (%, average of 10 runs) on CIFAR. SENet-29 and SKNet-29 are all based on ResNeXt-29, 16×32d

4. Ablation Study

4.1. The Dilation D and Group Number G

Results of SKNet-50 with different settings in the second branch, while the setting of the first kernel is fixed
  • The optimal settings for the other branch are those with kernel size 5×5. It is proved beneficial to use different kernel sizes.

4.2. Combination of Different Kernels

Results of SKNet-50 with different combinations of multiple kernels
  • When the number of paths M increases, in general the recognition error decreases.
  • No matter M = 2 or 3, SK attention-based aggregation of multiple paths always achieves lower top-1 error than the simple aggregation method.
  • Using SK attention, the performance gain of the model from M = 2 to M = 3 is marginal (the top-1 error decreases from 20.79% to 20.76%). For better trade-off between performance and efficiency, M = 2 is preferred.

5. Analysis and Interpretation

Attention results for two randomly sampled images with three differently sized targets
Attention results for two randomly sampled images with three differently sized targets
Average results over all image instances in the ImageNet validation set
  • The larger the target object is, the more attention will be assigned to larger kernels by the Selective Kernel mechanism in low and middle level stages (e.g., SK 2_3, SK 3_4). However, at much higher layers (e.g., SK 5_3), all scale information is getting lost and such a pattern disappears.
Average mean attention difference (mean attention value of kernel 5×5 minus that of kernel 3×3) on SK units of SKNet-50, for each of 1,000 categories using all validation samples on ImageNet

PhD, Researcher. I share what I've learnt and done. :) My LinkedIn: https://www.linkedin.com/in/sh-tsang/, My Paper Reading List: https://bit.ly/33TDhxG

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store