# Brief Review — SNIP: Single-Shot Network Pruning Based on Connection Sensitivity

## One of the Popular Model Pruning Papers

SNIP: Single-Shot Network Pruning Based on Connection Sensitivity, by University of Oxford

SNIP2019 ICLR, Over 1200 Citations(Sik-Ho Tsang @ Medium)

Image Classification[Vision Permutator (ViP)] [ConvMixer] [CrossFormer++] [FastViT] [EfficientFormerV2] [MobileViTv2] [ConvNeXt V2] [SwiftFormer] [OpenCLIP]

20232024[FasterViT] [CAS-ViT] [TinySaver]

==== My Other Paper Readings Are Also Over Here ====

**Single-Shot Network Pruning (SNIP)**is proposed, which**prunes a given network once at initialization prior to training.**- To achieve this,
**a saliency criterion based on connection sensitivity is introduced**that identifies structurally important connections in the network for the given task. **This eliminates the need for both pretraining and the complex pruning schedule**while making it robust to architecture variations.

# Outline

**Network Pruning Preliminaries****Single-Shot Network Pruning (SNIP)****Results**

# 1. Network Pruning Preliminaries

- In network pruning, given a large reference neural network, the goal is to learn a much smaller subnetwork that mimics the performance of the reference network.
- Given
**a dataset**, and*D*of {*xi*,*yi*}**a desired sparsity level**can be written as the following constrained optimization problem:*k*(i.e., the number of non-zero weights) neural network pruning

- Here,
is the*l*(.)**standard loss function**(e.g., cross-entropy loss),is*w***the set of parameters of the neural network**,is the*m***total number of parameters**and**||.||0**is the**standard L0 norm**. - The conventional approach to optimize the above problem is by
**adding sparsity enforcing penalty terms.** - Some using
**saliency based methods**. Popular criteria include magnitude of the weights, i.e.,**weights below a certain threshold are redundant**, and Hessian of the loss with respect to the weights, i.e., the**higher the value of Hessian, the higher the importance of the parameters.**

Despite being popular, both of these criteria depend on the scale of the weights and in turn

require pretraining and are very sensitive to the architectural choices.Furthermore,

pruning and the optimization steps are alternated many timesthroughout training, resulting inhighly expensive prune — retrain cycles.

# 2. Single-Shot Network Pruning (SNIP)

Auxiliary indicator variables, is introduced, whichc∈ {0; 1}represents the connectivity of parametersNow, given the sparsity levelw.k, the above equation can be correspondingly modified as below:

- where ⊙ denotes the Hadamard product.

The weight of the connection (we may be able to determine the importance of each connection by measuring its effect on the loss function, so thatw) is separated from whether the connection is present or not (c),cj=1 which means the connectionjis active andcj=0 means the connection j is pruned.

- Precisely, the effect of removing connection
*j*can be measured by:

- where
is the*ej***indicator vector of element**and*j*(i.e., zeros everywhere except at the index*j*where it is one)**1**is**the vector of dimension***m*. - By relaxing the binary constraint on the indicator variables
*c*, Δ*Lj*can be approximated by the derivative of*L*with respect to*cj*, which is denoted as. Hence,*gj*(*w*,*D*)**the effect of connection**can be written as:*j*on the loss

The magnitude of the derivativesgjis taken as thesaliency criterion. ifthe magnitude of the derivative is high(regardless of the sign), it essentially means thatthe connection(either positive or negative), andcjhas a considerable effect on the lossit has to be preserved to allow learning onwj.

- Therefore, the
**connection sensitivity**as**the normalized magnitude of the derivatives:**

**Once the sensitivity is computed, only the top-**Precisely, the indicator variables*k*connections are retained.*c*are set as follows:

- where ~
*sk*is the*k*-th largest element in the vector*s*and 1[.] is the indicator function.

In the experiments, authors show that

using only one mini-batch of a reasonable number of training examples can lead to effective pruning, as shown in Algorithm 1 above.

# 3. Results

## 3.1. Pruning Performance

**Two standard networks are considered for pruning,****LeNet****-300–100 and****LeNet****-5-Caffe.**LeNet-300–100 consists of three fully-connected (fc) layers with 267k parameters and LeNet-5-Caffe consists of two convolutional (conv) layers and two fc layers with 431k parameters.

The pruned sparseLeNet-300–100 achieves performances similar to the reference (~.k= 0), only with negligible loss at ~k=90ForLeNet-5-Caffe, the performance degradation is nearly invisible.

## 3.2. SOTA Comparisons

- It is difficult to have direct comparisons.

SNIP achieves errors that are

comparableto the reference model,degrading approximately 0.7% and 0.3% while pruning 98% and 99% of the parameters inLeNet-300–100 andLeNet-5-Caffe respectively.Compared to the above approaches,

SNIPseems tocost almost nothing.

## 3.3. **Modern **Models

Overall,

SNIP approach prunes a substantial amount of parameters in a variety of network models with minimal or no loss in accuracy (< 1%).

- To the best of authors’ knowledge,
**they are the first to demonstrate on convolutional, residual and recurrent networks for extreme sparsities without requiring additional hyperparameters or modifying the pruning procedure.**

## 3.4. Visualizations

Important connections seem to reconstruct either the complete image (MNIST) or silhouettes (Fashion-MNIST) of input class.

- (
**Different weight initializations**also affect the results. Please feel free to read the paper directly if interested.)