Review — RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition

RepMLP, Formulate Conv to FC, Absorb BN, Merge Branches into One

Sik-Ho Tsang
8 min readFeb 27


RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition,
RepMLP, by Tsinghua University, MEGVII Technology, and Aberystwyth University,
2022 CVPR, Over 50 Citations (Sik-Ho Tsang @ Medium)
Image Classification, MLP

1.1. Image Classification
1989 … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer] [Sparse MLP] [MViTv2] [S²-MLP] [CycleMLP] [MobileOne] [GC ViT] [VAN] [ACMix] [CVNets] [MobileViT] [RepMLP] 2023 [Vision Permutator (ViP)]
==== My Other Paper Readings Are Also Over Here ====

  • RepMLP is proposed, which is composed of a series of fully-connected (FC) layers.
  • A structural re-parameterization technique is used to add local prior into an FC to make it powerful for image recognition.
  • Specifically, convolutional layers are constructed inside a RepMLP during training and merge them into the FC for inference.


  1. Conversion of Conv to MLP/FC
  2. RepMLP
  3. Results

1. Conversion of Conv to MLP/FC

Sketch of a RepMLP.
  • The aim of RepMLP is to use MLP/FC only, and NOT to use Conv in the network.

A training-time RepMLP is composed of three parts termed as Global Perceptron, Partition Perceptron and Local Perceptron.

A training-time RepMLP is converted into three FC layers for inference, where the key is a simple, platform-agnostic and differentiable method for merging a conv into an FC.

In brief, authors want to convert conv to FC using reshaping trick, while FC needs a lot of parameters for whole image, patches are used so that a small FC can be used. BN is absorbed using merging trick.

With patches, correlation between patches cannot be utilized. Then Global Perceptron using FC1 and FC2 is introduced.

Partition Perceptron is the FC3 for the partition maps (patches after Global Perceptron), and Local Perceptron is a set of Convs with different receptive fields but treated as FC by reshaping trick.

(There should be more intuitions behind, please read the paper directly.)

1.1. Conv to FC Formulation

  • For convolution, an input tensor M(in) goes through a K×K conv is:
  • where M(out) is tensor output, F is conv kernel, and p is pad size.
  • For an Fully Connected (FC) layer, let P and Q be the input and output dimensions, V(in) and V(out) be the input and output, respectively, the kernel is W and the matrix multiplication (MMUL) is formulated as:
  • When FC uses (1) instead of (2), RS (short for “reshape”) is applied.
  • The input is first flattened into N vectors of length CHW, which is V(in)=RS(M(in),(N,CHW)), multiplied by the kernel W(OHW,CHW), then the output V(out)(N,OHW) is reshaped back into M(out)(N,O,H,W). (Reshaping is cost free.) For better reading, RS is omitted:

Such an FC cannot take advantage of the locality of images as it computes each output point according to every input point, unaware of the positional information.

  • Yet, FC in the above-mentioned manner is NOT used because of not only the lack of local prior but also the huge number of parameters.
  • To reduce the parameters, Global Perceptron and Partition Perceptron are proposed.

The above equations are to introduce the concept of converting conv into FC.

1.2. Global Perceptron

Upper Part of RepMLP (Left: Training, Right: Inference)
  • Global Perceptron splits up the feature map so that different partitions can share parameters. Every 7×7 block is as a partition.
  • The input M of size (N,C,H,W) is first reshaped into (N,C,H/h,h,W/w,w). Note that this operation is cost-free as it does not move data in memory.
  • Then, the order of axes is re-arranged as (N,H/h,W/w,C,h,w), which moves the data in memory efficiently. For example, it requires only one function call (permute) in PyTorch.
  • Then, the reordered tensor is reshaped (which is cost-free again) as (NHW/hw,C,h,w) (noted as a partition map in the figure). In this way, the number of parameters required is reduced from COH²W² to COh²w².

However, splitting breaks the correlations among different partitions of the same channel.

  • To add correlations onto each partition, Global Perceptron 1) uses average pooling to obtain a pixel for each partition, 2) feeds it though BN and a two-layer MLP (FC1 & FC2), then 3) reshapes and adds it onto the partition map.

Finally, the partition map is fed into Partition Perceptron and Local Perceptron.

1.3. Partition Perceptron

Partition Perceptron (Training)
  • Partition Perception has an FC and a BN layer.
  • Parameters of FC3 are further reduced inspired by groupwise conv (Xception). With g as the number of groups, the groupwise conv is formulated as:
  • Similarly, the kernel of groupwise FC is W of size Q×P/g , which has g× fewer parameters.

The implementation is composed of three steps: 1) reshaping V(in) as a “feature map” with spatial size of 1×1; 2) performing 1×1 conv with g groups; 3) reshaping the output “feature map” into V(out). The groupwise matrix multiplication (gMMUL) is formulated as:

1.4. Local Perceptron

Local Perceptron

Local Perceptron feeds the partition map through several conv layers. A BN follows every conv.

  • The number of groups g should be the same as the Partition Perceptron. The outputs of all the conv branches and Partition Perceptron are added up as the final output.

1.5. Merging Conv into FC

  • Here is to show how to merge a conv into FC. With the FC kernel W(1), conv kernel F, we want to have W’:
  • The additivity of MMUL ensures that:
  • Also, conv can be viewed as a sparse FC that shares parameters among spatial positions. So we can merge F into W(1) as long as we manage to construct W(F, p) of the same shape as W(1) which satisfies:
  • Thus, for any input M(in) and conv kernel F, padding p, there exists an FC kernel W(F,p) such that:
  • With formulation in Section 1.1, we can have:
  • With associative law, identity matrix can be inserted as below:

Finally, In short, the equivalently FC kernel of a conv kernel is the result of convolution on an identity matrix with proper reshaping:

2. RepMLP

2.1. Converting RepMLP into Three FC Layers

Inference Time RepMLP
  • BN layers are elimiated by equivalently fusing them into the preceding conv layers and FC3. The new kernel F’ and new bias b can be constructed as:
  • where the left side is the original computation flow of a conv-BN, and the right is the constructed conv with bias.
  • The 1D BN and FC3 of Partition Perceptron are fused in a similar way into ^W.
  • Every conv via Eq. 15 can also be added into ^W.

Finally, a single FC kernel and a single bias vector are obtained, which will be used to parameterize the inference-time FC3.

  • The BN in Global Perceptron is also removed, which can be absorbed by FC1.

2.2. RepMLP-ResNet

Sketch of a RepMLP Bottleneck.
  • RepMLP bottleneck further performs r× channel reduction before RepMLP and r× channel expansion afterwards via 3×3 conv.
  • In the middle of bottleneck, RepMLP is used.

3. Results

3.1. Ablation Study

Results with 224×224 input and different r. g in c4 only. The speed is in examples/second.

r = 2 or 4 and g = 4 or 8 are used for the better trade-off.

Using RepMLP in different stages of ResNet-50 with 224×224 input. The speed is in examples/second.
  • RepMLP should be combined with traditional conv for the best performance, as using it in all the four stages delivers lower accuracy than c2+c3+c4 and c3+c4.

Finally, RepMLP used in c3+c4, is used.

3.2. SOTA Comparisons on ImageNet

Comparisons with traditional ConvNets on ImageNet all trained with the identical settings.

Compared to the traditional ConvNets with comparable numbers of parameters, the FLOPs of RepMLP-Res50 is much lower and the speed is faster.

3.3. Face Recognition

Results of face recognition on MS1M-V2 and MegaFace. The speed is in examples/second.
  • For the RepMLP counterpart, FaceResNet is modified by replacing the stride-1 bottlenecks of c2,c3,c4 (i.e., the last two bottlenecks of c2 and the last blocks of c3,c4) by RepMLP Bottlenecks with h=w=6; r=2; g=4.

RepMLP-FaceRes outperforms in both accuracy and speed. Compared to MobileFaceNet, RepMLPFaceRes shows 4.91% higher accuracy and runs 8% faster.

3.4. Cityscapes

Semantic segmentation on Cityscapes tested on the validation subset. The speed is in examples/second.
  • Using ImageNet pretrained backbone and PSPNet as framework.
  • PSPNet with RepMLP-Res50-g4/8 outperforms the Res-50 backbone by 2.21% in mIoU. Though it has more parameters, the FLOPs is lower and the speed is faster.

3.5. RepMLP-ResNet for High Speed (Appendix)

The original bottleneck, RepMLP Bottleneck and RepMLP Light Block.
  • Only 1×1 is used for 8× channel reduction/expansion before/after RepMLP.
ResNet-50 with different blocks in c3 and c4. The speed is in examples/second.

ResNet with RepMLP Light Block achieves almost the same accuracy as the original ResNet-50 with 30% lower FLOPs and 55% faster speed.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.