Review — SKNet: Selective Kernel Networks (Image Classification)

Attention Branch for Various Kernel Sizes, Outperforms SENet

Sik-Ho Tsang
8 min readFeb 15, 2021

In this story, Selective Kernel Networks, SKNet, by Nanjing University of Science and Technology, Momenta, Nanjing University, and Tsinghua University, is reviewed.

In the visual cortex, the receptive field (RF) sizes of neurons in the same area (e.g., V1 region) are different, which enables the neurons to collect multi-scale spatial information in the same processing stage.

RF sizes of neurons are not fixed but modulated by stimulus.

In this paper:

  • A building block called Selective Kernel (SK) unit is designed, in which multiple branches with different kernel sizes are fused using softmax attention that is guided by the information in these branches.
  • Different attentions on these branches yield different sizes of the effective receptive fields of neurons in the fusion layer.
  • Multiple SK units are stacked to a deep network termed Selective Kernel Networks (SKNets).

This is a paper in 2019 CVPR with over 200 citations. (Sik-Ho Tsang @ Medium)


  1. Selective Kernel Convolution
  2. SKNet: Network Architecture
  3. Experimental Results
  4. Ablation Study
  5. Analysis and Interpretation

1. Selective Kernel Convolution

Selective Kernel Convolution
  • “Selective Kernel” (SK) convolution enable the neurons to adaptively adjust their RF sizes.
  • Specifically, we implement the SK convolution via three operators — Split, Fuse and Select. As shown above is a two-branch case.

1.1. Split

  • Two transformations ~F and ^F are conducted on input feature map X, to output ~U and ^U, with kernel sizes 3 and 5, respectively.
  • Both ~F and ^F are composed of efficient grouped/depthwise convolutions, Batch Normalization (BN) and ReLU in sequence.
  • For further efficiency, the conventional convolution with a 5×5 kernel is replaced with the dilated convolution with a 3×3 kernel and dilation size 2.

1.2. Fuse

  • First, the results from multiple (two in the above figure) branches are fused via an element-wise summation:
  • Global average pooling is applied to get the global information. Specifically, the c-th element of s is calculated by shrinking U through spatial dimensions H×W:
  • Then, sc is compress as a compact feature z. This is achieved by a simple fully connected (fc) layer, with the reduction of dimensionality for better efficiency:
  • where δ is the ReLU, β is batch norm (BN).
  • To study the impact of d on the efficiency of the model, we use a reduction ratio r to control its value:
  • where L denotes the minimal value of d (L = 32 is a typical setting in the experiments).

1.3. Select

  • A soft attention across channels is used to adaptively select different spatial scales of information, which is guided by the compact feature descriptor z. Specifically, a softmax operator is applied on the channel-wise digits:
  • In the case of two branches, the matrix B is redundant because ac+bc=1.
  • The final feature map V is obtained through the attention weights on various kernels:
  • The above formula is for the two-branch case and one can easily deduce situations with more branches.

1.4. Three-Branch Case

Image From 【CV中的Attention机制】SKNet-SENet的进化版:
  • The figure and link above illustrate the 3-branch case.

2. SKNet: Network Architecture

2.1. Network Architecture

Left: ResNeXt-50 with a 32×4d, Middle: SENet-50 based on the ResNeXt-50 backbone, Right: SKNet-50
  • Similar to the ResNeXt, the proposed SKNet is mainly composed of a stack of repeated bottleneck blocks, which are termed “SK units”.
  • Each SK unit consists of a sequence of 1×1 convolution, SK convolution and 1×1 convolution.
  • In general, all the large kernel convolutions in the original bottleneck blocks in ResNeXt are replaced by the proposed SK convolutions.
  • SKNet-50 only leads to 10% increase in the number of parameters and 5% increase in computational cost, compared with ResNeXt-50.

2.2. Hyperparameters & SKNet Variants

  • Three important hyper-parameters are determined in SKNet.
  • Number of paths M determines the number of choices of different kernels to be aggregated.
  • Group number G controls the cardinality of each path.
  • Reduction ratio r that controls the number of parameters in the fuse operator.
  • One typical setting of SK[M, G, r] is SK[2, 32, 16].
  • 50-layer SKNet-50 has four stages with {3,4,6,3} SK units, respectively.
  • SKNet-26 has {2,2,2,2} SK units, and SKNet-101 has {3,4,23,3} SK units.
  • SK convolutions can be applied to other lightweight networks, e.g., MobileNet, ShuffleNet, in which 3×3 depthwise convolutions are extensively used.

3. Experimental Results

3.1. ImageNet

Comparisons to the state-of-the-arts under roughly identical complexity on ImageNet validation set
  • SKNets consistently improve performance over the state-of-the-art attention-based CNNs under similar budgets.
  • Remarkably, SKNet-50 outperforms ResNeXt-101 by above absolute 0.32%, although ResNeXt-101 is 60% larger in parameter and 80% larger in computation.
  • With comparable or less complexity than InceptionNets, SKNets achieve above absolute 1.5% gain of performance, which demonstrates the superiority of adaptive aggregation for multiple kernels.

Using slightly less parameters, SKNets can obtain 0.3~0.4% gains to SENet counterparts in both 224×224 and 320×320 evaluations.

3.2. Selective Kernel vs. Depth/Width/Cardinality on ImageNet

Comparisons on ImageNet validation set when the computational cost of model with more depth/width/cardinality is increased to match that of SKNet
  • For fair comparison, the complexity of ResNeXt is also increased by changing its depth, width and cardinality, to match the complexity of SKNets.
  • However, the improvement is marginal when going deeper (0.19% from ResNeXt-50 to ResNeXt-53) or wider (0.1% from ResNeXt-50 to ResNeXt-50 wider), or with slightly more cardinality (0.23% from ResNeXt-50 (32×4d) to ResNeXt-50 (36×4d)).

In contrast, SKNet-50 obtains 1.44% absolute improvement over the baseline ResNeXt-50, which indicates that SK convolution is very efficient.

3.3. Performance With Respect to the Number of Parameters

Top-1 Error (%) vs Number of Parameters on ImageNet validation set
  • It is seen that SKNets utilizes parameters more efficiently than these models. For instance, achieving 20.2 top-1 error, SKNet-101 needs 22% fewer parameters than DPN-98.

3.4. Lightweight Models

Top-1 errors (%) for Lightweight Models on ImageNet validation set
  • SK convolutions not only boost the accuracy of baselines significantly but also perform better than SE.
  • This indicates the great potential of the SK convolutions in applications on low-end devices.

3.5. CIFAR

Top-1 errors (%, average of 10 runs) on CIFAR. SENet-29 and SKNet-29 are all based on ResNeXt-29, 16×32d
  • Notably, SKNet-29 achieves better or comparable performance than ResNeXt-29, 16×64d with 60% fewer parameters and it consistently outperforms SENet-29 on both CIFAR-10 and 100 with 22% fewer parameters.

4. Ablation Study

4.1. The Dilation D and Group Number G

Results of SKNet-50 with different settings in the second branch, while the setting of the first kernel is fixed
  • To study their effects, two-branch case is used and the setting is: 3×3 filter with dilation D=1 and group G=32 in the first kernel branch of SKNet-50.
  • The optimal settings for the other branch are those with kernel size 5×5. It is proved beneficial to use different kernel sizes.

4.2. Combination of Different Kernels

Results of SKNet-50 with different combinations of multiple kernels
  • K5, K7 are stacks of 3×3 filters.
  • When the number of paths M increases, in general the recognition error decreases.
  • No matter M = 2 or 3, SK attention-based aggregation of multiple paths always achieves lower top-1 error than the simple aggregation method.
  • Using SK attention, the performance gain of the model from M = 2 to M = 3 is marginal (the top-1 error decreases from 20.79% to 20.76%). For better trade-off between performance and efficiency, M = 2 is preferred.

5. Analysis and Interpretation

Attention results for two randomly sampled images with three differently sized targets
Attention results for two randomly sampled images with three differently sized targets
  • The above figures show the attention values in all channels for two randomly samples in SK 3_4.

It is seen that in most channels, when the target object enlarges, the attention weight for the large kernel (5×5) increases, which suggests that the RF sizes of the neurons are adaptively getting larger, which agrees with the expectation.

Average results over all image instances in the ImageNet validation set
  • Another surprising pattern is found about the role of adaptive selection across depth.
  • The larger the target object is, the more attention will be assigned to larger kernels by the Selective Kernel mechanism in low and middle level stages (e.g., SK 2_3, SK 3_4). However, at much higher layers (e.g., SK 5_3), all scale information is getting lost and such a pattern disappears.
Average mean attention difference (mean attention value of kernel 5×5 minus that of kernel 3×3) on SK units of SKNet-50, for each of 1,000 categories using all validation samples on ImageNet
  • The importance of kernel 5×5 consistently and simultaneously increases when the scale of targets grows.

In the early parts of networks, the appropriate kernel sizes can be selected according to the semantic awareness of objects’ sizes, thus it efficiently adjusts the RF sizes of these neurons.

However, such pattern is not existed in the very high layers like SK 5_3, since for the high-level representation, “scale” is partially encoded in the feature vector, and the kernel size matters less compared to the situation in lower layers.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.