Review — BagNet: Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet (Image Classification)

BagNet: Bag-of-Feature (BoF) Models Using ResNet, Better Interpretability

Sik-Ho Tsang
4 min readDec 17, 2021

In this story, Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet, (BagNet), by Eberhard Karls University of Tübingen, Werner Reichardt Centre for Integrative Neuroscience, and Bernstein Center for Computational Neuroscience, is briefly reviewed. In this paper:

  • BagNet is proposed, which is based on ResNet-50, by classifying an image based on the occurrences of small local image features.
  • This strategy is closely related to the bag-of-feature (BoF) models where BoF is a very popular technique before the onset of deep learning.
  • Given the simplicity of BoF models, a bit of accuracy is traded for better interpretability, with applications: Diagnosing failure cases, benchmarking diagnostic tools, or serving as interpretable parts of a computer vision pipeline.

This is a paper in 2019 ICLR with over 300 citations. (Sik-Ho Tsang @ Medium)


  1. Deep Bag-of-features Model (BagNet)
  2. Experimental Results

1. Deep Bag-of-features Model (BagNet)

Deep Bag-of-features Models (BagNets): Network Architecture & Performance

1.1. Network Architecture

  • (A) BagNet: First, a 2048 dimensional feature representation is inferred from each image patch of size q×q pixels using multiple stacked ResNet blocks and apply a linear classifier to infer the class evidence for each patch (heatmaps). One logit heatmap per class is obtained.
  • These heatmaps are averaged across space and passed through a softmax to get the final class probabilities.
The BagNet architecture is almost equivalent to the ResNet-50 architectures
  • The structure differs from ResNets only in the replacement of many 3×3 by 1×1 convolutions, thereby limiting the receptive field size of the topmost convolutional layer to q×q pixels.
  • The resulting architecture as BagNet-q and test q∈[9, 17, 33].
  • The word linear here refers to the combination of a linear spatial aggregation (a simple average) and a linear classifier on top of the aggregated features.

1.2. Performance

  • (B) Top-5 ImageNet performance: It outperforms AlexNet with such small receptive field for each sub-network though it is still not as good as VGG-16.
  • The constraint on local features makes it straight-forward to analyse how exactly each part of the image influences the classification
  • (c) Correlation with logits of VGG-16: The correlation is higher when the q is larger.

2. Experimental Results

2.1. ImageNet

  • BagNets are directly trained on ImageNet. Surprisingly, patch sizes as small as 17×17 pixels suffice to reach AlexNet performance (80.5% top-5 performance) while patches sizes 33×33 pixels suffice to reach close to 87.6%.
  • Across all receptive field sizes BagNets reach around 155 images/s for BagNets compared to 570 images/s for ResNet-50. The difference in runtime can be attributed to the reduced amount of downsampling in BagNets.

2.2. Explaining Decisions

The most important contribution is the explanation of the network.

Heatmaps showing the class evidence extracted from of each part of the image. The spatial sum over the evidence is the total class evidence
  • Clearly, most evidence lies around the shapes of objects (e.g. the crip or the paddles) or certain predictive image features like the glowing borders of the pumpkin.
  • Also, for animals eyes or legs are important. It’s also notable that background features (like the forest in the deer image) are pretty much ignored by the BagNets.
Most informative image patches for BagNets.
  • Top subrow are the patches that caused the highest logit outputs for the given class across all validation images with that label.
  • Bottom subrow are patches with a different label (highlighting errors).

This visualisation yields many insights: For example, book jackets are identified mainly by the text on the cover, leading to confusion with other text on t-shirts or websites.

2.3. Explaining Misclassifications

Images misclassified by BagNet-33 and VGG-16 with heatmaps for true and predicted label and the most predictive image patches. Class probability reported for BagNet-33 (left) and VGG (right)
  • The above figure analyse misclassified images by both BagNet-33 and VGG-16. In the first example the ground-truth class “cleaver” was confused with “granny smith” because of the green cucumber at the top of the image.

The letters in the “miniskirt” image are very salient, thus leading to the “book jacket” prediction.

  • (There are many other analyses in the paper. Please feel free to read the paper.)



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.