Review — SKNet: Selective Kernel Networks (Image Classification)

Attention Branch for Various Kernel Sizes, Outperforms SENet

In the visual cortex, the receptive field (RF) sizes of neurons in the same area (e.g., V1 region) are different, which enables the neurons to collect multi-scale spatial information in the same processing stage.

RF sizes of neurons are not fixed but modulated by stimulus.


1. Selective Kernel Convolution

Selective Kernel Convolution
Image From 【CV中的Attention机制】SKNet-SENet的进化版:

2. SKNet: Network Architecture

Left: ResNeXt-50 with a 32×4d, Middle: SENet-50 based on the ResNeXt-50 backbone, Right: SKNet-50

3. Experimental Results

Comparisons to the state-of-the-arts under roughly identical complexity on ImageNet validation set

Using slightly less parameters, SKNets can obtain 0.3~0.4% gains to SENet counterparts in both 224×224 and 320×320 evaluations.

Comparisons on ImageNet validation set when the computational cost of model with more depth/width/cardinality is increased to match that of SKNet

In contrast, SKNet-50 obtains 1.44% absolute improvement over the baseline ResNeXt-50, which indicates that SK convolution is very efficient.

Top-1 Error (%) vs Number of Parameters on ImageNet validation set
Top-1 errors (%) for Lightweight Models on ImageNet validation set
Top-1 errors (%, average of 10 runs) on CIFAR. SENet-29 and SKNet-29 are all based on ResNeXt-29, 16×32d

4. Ablation Study

Results of SKNet-50 with different settings in the second branch, while the setting of the first kernel is fixed
Results of SKNet-50 with different combinations of multiple kernels

5. Analysis and Interpretation

Attention results for two randomly sampled images with three differently sized targets
Attention results for two randomly sampled images with three differently sized targets

It is seen that in most channels, when the target object enlarges, the attention weight for the large kernel (5×5) increases, which suggests that the RF sizes of the neurons are adaptively getting larger, which agrees with the expectation.

Average results over all image instances in the ImageNet validation set
Average mean attention difference (mean attention value of kernel 5×5 minus that of kernel 3×3) on SK units of SKNet-50, for each of 1,000 categories using all validation samples on ImageNet

In the early parts of networks, the appropriate kernel sizes can be selected according to the semantic awareness of objects’ sizes, thus it efficiently adjusts the RF sizes of these neurons.

However, such pattern is not existed in the very high layers like SK 5_3, since for the high-level representation, “scale” is partially encoded in the feature vector, and the kernel size matters less compared to the situation in lower layers.



PhD, Researcher. I share what I learn. :) Reads:, LinkedIn:, Twitter:

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store