# Review — AdderNet: Do We Really Need Multiplications in Deep Learning? (Image Classification)

## Using Addition Instead of Multiplication for Convolution, Lower Latency Than the Conventional CNN

In this story, **AdderNet: Do We Really Need Multiplications in Deep Learning?**, (AdderNet), by Peking University, Huawei Noah’s Ark Lab, and The University of Sydney, is reviewed.

Do We Really Need Multiplications in Deep Learning?

In this paper:

**AdderNet**, using additions, takes the*l*1-norm distance between filters and input feature- Compared with multiplications,
**additions are much cheaper and reduce the computation costs.**

This is a paper in **2020 CVPR **with over **20 citations**. (Sik-Ho Tsang @ Medium)

# Outline

**AdderNet Convolution****Other Concerning Issues:****BN****, Derivatives, Learning Rate****Experimental Results**

**1. AdderNet Convolution**

## 1.1. Generalized Filters

- Generally, the output feature
*Y*indicates the similarity between the filter and the input feature:

- where
*S*is the similarity measure.

## 1.2. **Standard Convolution Using Multiplication**

- If
**cross-correlation**is taken as the metric of distance,**multiplication**is used. It becomes**convolution**.

## 1.3. AdderNet Convolution Using Addition

- If
**addition**is used,between the filter and the input feature is calculated:*l*1 distance

- With the help of
*l*1 distance, similarity between the filters and features can be efficiently computed.

Addition is much lower computational expensive than multiplication.

Intuitively, the above equation has a connection with

template matchingin computer vision, which aims to find the parts of an image that match the template.

# 2. Other Concerning Issues: BN, Derivatives, Learning Rate

## 2.1. Batch Normalization (BN)

- After addition,
**batch normalization (****BN****)**is used to normalize*Y*to an appropriate range and all the**activation functions**used in conventional CNNs can then be used in the proposed AdderNets. - Although the BN layer involves multiplications, its computational cost is significantly lower than that of the convolutional layers and can be omitted.
- (Will there be any BN just using addition in the future?)

## 2.2. Derivatives

- The derivative of
*l*1-norm is not good for gradient descent. Thus, the derivative of*l*2-norm is considered:

- By utilizing the full-precision gradient, the filters can be updated precisely.
- To prevent gradients from exploding, the gradient of
*X*is clipped to [-1,1]. - Then the partial derivative of output features
*Y*with respect to the input features*X*is calculated as:

- where
*HT*is HardTanh function:

## 2.3. Adaptive Learning Rate

- As shown in this table, the norms of gradients of filters in AdderNets are much smaller than that in CNNs, which could slow down the update of filters in AdderNets.
- An adaptive learning rate for different layers in AdderNets is used:

- where
*γ*is a global learning rate of the whole neural network (e.g. for adder and BN layers), Δ*L*(*Fl*) is the gradient of the filter in layer*l*and*αl*is its corresponding local learning rate. - The local learning rate can therefore be defined as:

- where
*k*denotes the number of elements in*Fl*, and*η*is a hyper-parameter to control the learning rate of adder filters.

# 3. Experimental Results

## 3.1. MNIST

- LeNet-5-BN is trained.
**CNN**achieves a**99.4% accuracy**with**435K multiplications**and**435K additions**.- By replacing the multiplications in convolution with additions, the proposed
**AdderNet**achieves a**99.4% accuracy**, which is the same as that of CNNs, with**870K additions**and**almost no multiplication**. - In fact, the theoretical latency of multiplications in CPUs is also larger than that of additions and subtractions.
- For example, in VIA Nano 2000 series, the latency of float multiplication and addition is 4 and 2, respectively. The
**AdderNet**using LeNet-5 model will have**1.7M latency**while**CNN**will have**2.6M latency**in this CPU.

## 3.2. CIFAR

**Binary neural networks (BNN)**: It can use the**XNOR**operations to replace multiplications, it is also used for comparison.**For****VGG****-small model, AdderNets achieve nearly the same results (93.72% in CIFAR-10 and 72.64% in CIFAR-100) with CNNs (93.80% in CIFAR-10 and 72.73% in CIFAR-100) with no multiplication.**- Although the model size of BNN is much smaller than those of AdderNet and CNN, its accuracies are much lower (89.80% in CIFAR-10 and 65.41% in CIFAR-100).
- As for the ResNet-20, CNNs achieve the highest accuracy (i.e. 92.25% in CIFAR-10 and 68.14% in CIFAR-100) but with a large number of multiplications (41.17M).
**The proposed AdderNets achieve a 91.84% accuracy in CIFAR-10 and a 67.60% accuracy in CIFAR-100 without multiplications, which is comparable with CNNs.**- In contrast, the BNNs only achieve 84.87% and 54.14% accuracies in CIFAR-10 and CIFAR-100.
**The results in****ResNet****-32 also suggest that the proposed AdderNets can achieve similar results with conventional CNNs.**

## 3.3. ImageNet

- CNN achieves a 69.8% top-1 accuracy and an 89.1% top-5 accuracy in ResNet-18. However, there are 1.8G multiplications.
**AdderNet achieve a 66.8% top-1 accuracy and an 87.4% top-5 accuracy in****ResNet****-18, which demonstrate the adder filters can extract useful information from images.**- Although the BNN can achieve high speed-up and compression ratio, it achieves only a 51.2% top-1 accuracy and a 73.2% top-5 accuracy in ResNet-18.
- Similar results for deeper ResNet-50.

## 3.4. Visualization Results

- A LeNet++ is trained on the MNIST dataset, which has six convolutional layers and a fully-connected layer for extracting powerful 3D features.
- Numbers of neurons in each convolutional layer are 32, 32, 64, 64, 128, 128, and 2, respectively.
- AdderNets utilize the
*l*1-norm to distinguish different classes. The features tend to be clustered towards different class centers. - The visualization results demonstrate that the proposed AdderNets could have the similar discrimination ability to classify images as CNNs.

- The filters of the proposed adderNets still share some similar patterns with convolution filters.
- The visualization experiments further demonstrate that the filters of AdderNets can effectively extract useful information from the input images and features.

- The distribution of weights with AdderNets is close to a Laplace distribution while that with CNNs looks more like a Gaussian distribution. In fact, the prior distribution of
*l*1-norm is Laplace distribution.

## 3.5. Ablation Study

- The AdderNets using adaptive learning rate (ALR) and increased learning rate (ILR) achieve 97.99% and 97.72% accuracy with
**sign gradient**, which is**much lower than the accuracy of CNN (99.40%).** - Therefore, we propose the full-precision gradient to precisely update the weights in AdderNets.
- As a result, the AdderNet with ILR achieves a 98.99% accuracy using the
**full-precision gradient**. By using the**adaptive learning rate (ALR)**, the**AdderNet can achieve a 99.40% accuracy**, which demonstrate the effectiveness of the proposed ALR method.

## Reference

[2020 CVPR] [AdderNet]

AdderNet: Do We Really Need Multiplications in Deep Learning?

## Image Classification

**1989–1998**: [LeNet]**2012–2014**: [AlexNet & CaffeNet] [Dropout] [Maxout] [NIN] [ZFNet] [SPPNet] [Distillation]**2015**: [VGGNet] [Highway] [PReLU-Net] [STN] [DeepImage] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2]**2016**: [SqueezeNet] [Inception-v3] [ResNet] [Pre-Activation ResNet] [RiR] [Stochastic Depth] [WRN] [Trimps-Soushen]**2017**: [Inception-v4] [Xception] [MobileNetV1] [Shake-Shake] [Cutout] [FractalNet] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet] [DRN] [DPN] [Residual Attention Network] [IGCNet / IGCV1] [Deep Roots]**2018**: [RoR] [DMRNet / DFN-MR] [MSDNet] [ShuffleNet V1] [SENet] [NASNet] [MobileNetV2] [CondenseNet] [IGCV2] [IGCV3] [FishNet] [SqueezeNext] [ENAS] [PNASNet] [ShuffleNet V2] [BAM] [CBAM] [MorphNet] [NetAdapt] [mixup] [DropBlock] [Group Norm (GN)]**2019**: [ResNet-38] [AmoebaNet] [ESPNetv2] [MnasNet] [Single-Path NAS] [DARTS] [ProxylessNAS] [MobileNetV3] [FBNet] [ShakeDrop] [CutMix] [MixConv] [EfficientNet] [ABN] [SKNet] [CB Loss]**2020**: [Random Erasing (RE)] [SAOL] [AdderNet]