# Review — RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition

## RepMLP, Formulate Conv to FC, Absorb **BN**, Merge Branches into One

RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition,RepMLP, by Tsinghua University, MEGVII Technology, and Aberystwyth University,2022 CVPR, Over 50 Citations(Sik-Ho Tsang @ Medium)

Image Classification, MLP

1.1. Image Classification1989 … 2022[ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer] [Sparse MLP] [MViTv2] [S²-MLP] [CycleMLP] [MobileOne] [GC ViT] [VAN] [ACMix] [CVNets] [MobileViT] [RepMLP]2023[Vision Permutator (ViP)]

==== My Other Paper Readings Are Also Over Here ====

**RepMLP**is proposed, which is composed of**a series of fully-connected (FC) layers.****A structural re-parameterization technique**is used to**add local prior into an FC**to make it powerful for image recognition.- Specifically,
**convolutional layers**are constructed inside a RepMLP**during training**and**merge them into the FC for inference.**

# Outline

**Conversion of Conv to MLP/FC****RepMLP****Results**

**1. **Conversion of Conv to MLP/FC

- The aim of RepMLP is to
**use MLP/FC only**, and**NOT to use Conv**in the network.

A training-time RepMLP is composed of

three partstermed asGlobal Perceptron,Partition PerceptronandLocal Perceptron.

A training-time RepMLPisconverted into three FC layers for inference, where the key is a simple, platform-agnostic and differentiable method formerging a conv into an FC.

In brief, authors want to

convert conv to FC using reshaping trick, while FC needs a lot of parameters for whole image,patchesare used so that asmall FCcan be used.BNis absorbed using merging trick.With patches,

correlation between patches cannot be utilized. ThenGlobal PerceptronusingFC1andFC2is introduced.

Partition Perceptronis theFC3for thepartition maps (patches after Global Perceptron), andLocal Perceptronisa set of Convswith different receptive fields buttreated as FCby reshaping trick.(There should be more intuitions behind, please read the paper directly.)

## 1.1. Conv to FC Formulation

- For
**convolution, an input tensor**goes through a*M*(*in*)is:*K*×*K*conv

- where
*M*(*out*) is tensor output,*F*is conv kernel, and*p*is pad size. - For an Fully Connected (
**FC**) layer, let*P*and*Q*be the input and output dimensions,and*V*(*in*)be the*V*(*out*)**input**and**output**, respectively, the**kernel**isand the*W***matrix multiplication (MMUL)**is formulated as:

- When
**FC uses (1) instead of (2)**,**RS**(short for**“reshape”**) is applied. - The input is
**first flattened**into*N*vectors of length*CHW*, which is*V*(*in*)=*RS*(*M*(*in*),(*N*,*CHW*)),**multiplied by the kernel**, then the output*W*(*OHW*,*CHW*)*V*(*out*)(*N*,*OHW*) is**reshaped back**into*M*(*out*)(*N*,*O*,*H*,*W*). (Reshaping is cost free.) For better reading, RS is omitted:

Such an FC cannot take advantage of the locality of imagesas it computes each output point according to every input point,unaware of the positional information.

- Yet, FC in the above-mentioned manner is
**NOT used**because of not only**the lack of local prior**but also the**huge number of parameters**. - To reduce the parameters,
**Global Perceptron**and**Partition Perceptron**are proposed.

The above equations are to introduce the

concept of converting conv into FC.

## 1.2. Global Perceptron

- Global Perceptron
**splits up the feature map**so that**different partitions can share parameters.**Every 7×7 block is as a partition. - The
**input**of*M***size (**is*N*,*C*,*H*,*W*)**first reshaped**into**(**. Note that this operation is cost-free as it does not move data in memory.*N*,*C*,*H*/*h*,*h*,*W*/*w*,*w*) - Then, the
**order**of axes is**re-arranged as (**, which moves the data in memory efficiently. For example, it requires only one function call (permute) in PyTorch.*N*,*H*/*h*,*W*/*w*,*C*,*h*,*w*) - Then, the reordered tensor is
**reshaped**(which is cost-free again) as (*NHW*/*hw*,*C*,*h*,*w*) (noted as a**partition map**in the figure). In this way, the number of parameters required is reduced from*COH*²*W*² to*COh*²*w*².

However,

splitting breaks the correlations among different partitionsof the same channel.

**To add correlations**onto each partition, Global Perceptron 1) uses**average pooling**to obtain a pixel for each partition, 2) feeds it though**BN****two-layer MLP (FC1 & FC2)**, then 3)**reshapes**and**adds it onto the partition map**.

Finally, the partition map is fed into Partition Perceptron and Local Perceptron.

## 1.3. Partition Perceptron

- Partition Perception has an
**FC**and a**BN** **Parameters of FC3**are further**reduced**inspired by**groupwise conv**(**Xception**). Withas the*g***number of groups**, the groupwise conv is formulated as:

- Similarly,
**the kernel of groupwise**, which has*FC*is*W*of size Q×P/*g*.*g*× fewer parameters

The implementation is composed of

three steps: 1)reshapingas a “feature map” withV(in)spatial size of 1×1; 2) performing1×1 conv with; 3)ggroupsreshapingthe output “feature map”into. TheV(out)groupwise matrix multiplication (gMMUL) isformulated as:

## 1.4. Local Perceptron

Local Perceptron feeds the partition map through

several conv layers. A BN follows every conv.

- The
**number of groups**should be the same as the Partition Perceptron. The*g***outputs of all the conv branches**and Partition Perceptron are**added up**as the**final output**.

## 1.5. Merging Conv into FC

- Here is to show how to
**merge a conv into FC**. With the FC kernel*W*(1), conv kernel*F*, we want to have*W*’:

- The
**additivity**of MMUL ensures that:

- Also,
**conv**can be viewed as a**sparse FC that shares parameters among spatial positions.**So we can**merge**as long as we*F*into*W*(1)**manage to construct**of the same shape as*W*(*F*,*p*)*W*(1) which satisfies:

- Thus,
**for any input**, padding*M*(*in*) and conv kernel*F**p*,**there exists an FC kernel**such that:*W*(*F*,*p*)

- With
**formulation in Section 1.1**, we can have:

- With associative law,
**identity matrix can be inserted**as below:

Finally, In short, the

equivalently FC kernel of a conv kernelis the result ofconvolution on an identity matrixwith properreshaping:

# 2. **RepMLP**

## 2.1. Converting RepMLP into Three FC Layers

**BN****layers**are elimiated by equivalently**fusing them into the preceding conv layers and FC3.**The**new kernel**can be constructed as:*F*’ and new bias*b*’

- where the
**left**side is the**original**computation flow of a conv-BN, and the**right**is the**constructed conv with bias**. **The 1D****BN****and FC3 of Partition Perceptron are fused**in a similar way into**^***W*.**Every conv via Eq. 15**can also be**added into ^**.*W*

Finally,

a single FC kernel and a single bias vector are obtained, which will be used to parameterize the inference-time FC3.

- The
**BN****in Global Perceptron**is also removed, which can be**absorbed by FC1.**

## 2.2. RepMLP-ResNet

- RepMLP bottleneck
**further performs**before RepMLP and*r*× channel reduction*r*× channel expansion afterwards via 3×3 conv - In the middle of bottleneck, RepMLP is used.

# 3. Results

## 3.1. Ablation Study

r = 2 or 4andg = 4 or 8are used for thebetter trade-off.

**RepMLP should be combined with traditional conv for the best performance**, as using it in all the four stages delivers lower accuracy than c2+c3+c4 and c3+c4.

Finally,

RepMLP used in c3+c4, is used.

## 3.2. SOTA Comparisons on ImageNet

Compared to the traditional ConvNets with comparable numbers of parameters,

the FLOPs of RepMLP-Res50 is much lower and the speed is faster.

## 3.3. Face Recognition

- For the RepMLP counterpart, FaceResNet is modified by replacing the stride-1 bottlenecks of c2,c3,c4 (i.e., the last two bottlenecks of c2 and the last blocks of c3,c4) by
**RepMLP Bottlenecks**with*h*=*w*=6;*r*=2;*g*=4.

RepMLP-FaceRes outperforms in both accuracy and speed.Compared to MobileFaceNet, RepMLPFaceRes shows 4.91% higher accuracy and runs 8% faster.

## 3.4. Cityscapes

- Using ImageNet pretrained backbone and PSPNet as framework.
**PSPNet****with RepMLP-Res50-**. Though it has more parameters,*g*4/8 outperforms the Res-50 backbone by 2.21% in mIoU**the FLOPs is lower and the speed is faster.**

## 3.5. RepMLP-ResNet for High Speed (Appendix)

**Only 1×1**is used for**8× channel reduction/expansion**before/after RepMLP.

ResNet with RepMLP Light Block achieve

s almost the same accuracyas the original ResNet-50 with30% lower FLOPsand55% faster speed.