Brief Review — SNIP: Single-Shot Network Pruning Based on Connection Sensitivity

One of the Popular Model Pruning Papers

Sik-Ho Tsang
5 min readSep 26, 2024

SNIP: Single-Shot Network Pruning Based on Connection Sensitivity
SNIP
, by University of Oxford
2019 ICLR, Over 1200 Citations (Sik-Ho Tsang @ Medium)

Image Classification
2023
[Vision Permutator (ViP)] [ConvMixer] [CrossFormer++] [FastViT] [EfficientFormerV2] [MobileViTv2] [ConvNeXt V2] [SwiftFormer] [OpenCLIP] 2024 [FasterViT] [CAS-ViT] [TinySaver]
==== My Other Paper Readings Are Also Over Here ====

  • Single-Shot Network Pruning (SNIP) is proposed, which prunes a given network once at initialization prior to training.
  • To achieve this, a saliency criterion based on connection sensitivity is introduced that identifies structurally important connections in the network for the given task.
  • This eliminates the need for both pretraining and the complex pruning schedule while making it robust to architecture variations.

Outline

  1. Network Pruning Preliminaries
  2. Single-Shot Network Pruning (SNIP)
  3. Results

1. Network Pruning Preliminaries

  • In network pruning, given a large reference neural network, the goal is to learn a much smaller subnetwork that mimics the performance of the reference network.
  • Given a dataset D of {xi, yi}, and a desired sparsity level k (i.e., the number of non-zero weights) neural network pruning can be written as the following constrained optimization problem:
  • Here, l(.) is the standard loss function (e.g., cross-entropy loss), w is the set of parameters of the neural network, m is the total number of parameters and ||.||0 is the standard L0 norm.
  • The conventional approach to optimize the above problem is by adding sparsity enforcing penalty terms.
  • Some using saliency based methods. Popular criteria include magnitude of the weights, i.e., weights below a certain threshold are redundant, and Hessian of the loss with respect to the weights, i.e., the higher the value of Hessian, the higher the importance of the parameters.

Despite being popular, both of these criteria depend on the scale of the weights and in turn require pretraining and are very sensitive to the architectural choices.

Furthermore, pruning and the optimization steps are alternated many times throughout training, resulting in highly expensive prune — retrain cycles.

2. Single-Shot Network Pruning (SNIP)

Auxiliary indicator variables c ∈ {0; 1}, is introduced, which represents the connectivity of parameters w. Now, given the sparsity level k, the above equation can be correspondingly modified as below:

  • where ⊙ denotes the Hadamard product.

The weight of the connection (w) is separated from whether the connection is present or not (c), we may be able to determine the importance of each connection by measuring its effect on the loss function, so that cj=1 which means the connection j is active and cj=0 means the connection j is pruned.

  • Precisely, the effect of removing connection j can be measured by:
  • where ej is the indicator vector of element j (i.e., zeros everywhere except at the index j where it is one) and 1 is the vector of dimension m.
  • By relaxing the binary constraint on the indicator variables c, ΔLj can be approximated by the derivative of L with respect to cj, which is denoted as gj(w, D). Hence, the effect of connection j on the loss can be written as:

The magnitude of the derivatives gj is taken as the saliency criterion. if the magnitude of the derivative is high (regardless of the sign), it essentially means that the connection cj has a considerable effect on the loss (either positive or negative), and it has to be preserved to allow learning on wj.

  • Therefore, the connection sensitivity as the normalized magnitude of the derivatives:
  • Once the sensitivity is computed, only the top-k connections are retained. Precisely, the indicator variables c are set as follows:
  • where ~sk is the k-th largest element in the vector s and 1[.] is the indicator function.
Single-Shot Network Pruning (SNIP)

In the experiments, authors show that using only one mini-batch of a reasonable number of training examples can lead to effective pruning, as shown in Algorithm 1 above.

3. Results

3.1. Pruning Performance

Pruning Performance
  • Two standard networks are considered for pruning, LeNet-300–100 and LeNet-5-Caffe. LeNet-300–100 consists of three fully-connected (fc) layers with 267k parameters and LeNet-5-Caffe consists of two convolutional (conv) layers and two fc layers with 431k parameters.

The pruned sparse LeNet-300–100 achieves performances similar to the reference (~k = 0), only with negligible loss at ~k=90. For LeNet-5-Caffe, the performance degradation is nearly invisible.

3.2. SOTA Comparisons

SOTA Comparisons
  • It is difficult to have direct comparisons.

SNIP achieves errors that are comparable to the reference model, degrading approximately 0.7% and 0.3% while pruning 98% and 99% of the parameters in LeNet-300–100 and LeNet-5-Caffe respectively.

Compared to the above approaches, SNIP seems to cost almost nothing.

3.3. Modern Models

Modern Models such as AlexNet, VGGNet, WRN, LSTM and GRU

Overall, SNIP approach prunes a substantial amount of parameters in a variety of network models with minimal or no loss in accuracy (< 1%).

  • To the best of authors’ knowledge, they are the first to demonstrate on convolutional, residual and recurrent networks for extreme sparsities without requiring additional hyperparameters or modifying the pruning procedure.

3.4. Visualizations

Visualizations

Important connections seem to reconstruct either the complete image (MNIST) or silhouettes (Fashion-MNIST) of input class.

  • (Different weight initializations also affect the results. Please feel free to read the paper directly if interested.)

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.