# Review — CMT: Convolutional Neural Networks Meet Vision Transformers

## CMT, CNN Meets Vision Transformer

--

CMT: Convolutional Neural Networks Meet Vision Transformers, by University of Sydney, Huawei Noah’s Ark Lab

CMT2022 CVPR, Over 300 Citations(Sik-Ho Tsang @ Medium)

Image Classification[Vision Permutator (ViP)] [ConvMixer] [CrossFormer++]

1989 … 2023

==== My Other Paper Readings Are Also Over Here ====

**A new****Transformer****based hybrid network**is proposed by taking advantage of**Transformer****s**to**capture long-range dependencies**, and of**CNNs**to**extract local information**.- A family of models, CMTs, is constructed, which has
**much better trade-off for accuracy and efficiency**, as shown above.

# Outline

**CMT: Architecture****CMT: Complexity, Scaling, and Variants****Results**

**1. CMT: Architecture**

The proposed

CMT blockconsists of alocal perception unit (LPU), alightweight multi-head self-attention (LMHSA)module, and an invertedresidual feed-forward network (IRFFN).

**1.1. Local Perception Unit (LPU)**

- The
**absolute positional encoding**used in previous Transformers, initially designed to leverage the order of tokens,**damages translation-invariance.**

To alleviate the limitations,

LPUis proposed toextract local information using depth-wise convolution (MobileNetV1), which is defined as:

## 1.2. Lightweight Multi-Head Self-Attention (LMHSA)

- In
**original self-attention**module, the self-attention module is:

To mitigate the computation overhead,

ak×kdepth-wise convolution with stridek(MobileNetV1)is used toreduce the spatial size ofbefore the attention operation. Also,KandVa relative position biasB(Similar toShaw NAACL’18) is addedto each self-attention module:

- And there are
*h*heads

## 1.3. Inverted Residual Feed-forward Network (IRFFN)

**Original FFN**uses two linear layers with GELU in between:

IRFFNis proposed, which consists ofan expansion layer followed by a depth-wise convolution (MobileNetV1) and a projection layer.Specifically,the location of shortcut connection is also modifiedfor better performance:

- The
**depth-wise convolution (****MobileNetV1****)**is used to extract local information with**negligible extra computational cost.**

## 1.4. CMT Block

With the aforementioned three components, the

CMT blockcan be formulated as:

- where
*Yi*and*Zi*denote the output features of LPU and LMHSA module for the*i*-th block, respectively. LN denotes the layer normalization.

# 2. CMT: Complexity, Scaling, and Variants

## 2.1. Model Complexity

- The
**computational complexity (FLOPs)**of a**Transformer**

- where
*r*is the expansion ratio of FFN,*dk*and*dv*are dimensions of key and value, respectively. **ViT***d*=*dk*=*dv*and*r*= 4, the**cost**can be simplified as:

- The FLOPs of
**CMT block**:

- where
*k*≥ 1 is the reduction ratio in LMHSA.

Compared to standard Transformer block, the

CMT block is more friendly to computational cost, and iseasier to process the feature map under higher resolution (larger n).

## 2.2. Scaling Strategy

Inspired by EfficientNet,

a compound coefficient:ϕis used to uniformly scale the number of layers (depth), dimensions, and input resolution

**A constraint of**so that for a given new*α*·*β*^(1.5) ·*γ²*≈ 2.5 is added*ϕ*, the total FLOPS will approximately increase by 2.5^*ϕ*.is set empirically.*α*=1.2,*β*=1.3, and*γ*=1.15

## 2.3. CMT Variants

Based on

CMT-S, CMT-Ti, CMT-XS and CMT-Bare built according to the proposed scaling strategy. Theinput resolutionsare160², 192², 224², and 256²for all four models, respectively.

# 3. Results

## 3.1. Ablation Study

- ViT/DeiT can only generate single-scale feature map, losing a lot of multi-scale information, which is crucial for dense prediction.

When DeiT is with

4 stageslike CMT-S, i.e. DeiT-S-4Stage,improvementscan be achieved.

All the

incremental improvementsshow thatstem, LPU and IRFFNare also important to contribute the improved performance.

If

allLNs are replaced byBNs, the modelcannot convergeduring training.

Unidimensional scaling strategiesaresignificantly inferiorto theproposed compound scaling strategy,

## 3.2. ImageNet

CMTS achieves 83.5% top-1 accuracy with 4.0B FLOPs, which3.7% higher than the baseline modelDeiT-Sand2.0% higher thanCPVT, indicating the benefit of CMT block for capturing both local and global information.

- Note that all previous Transformer-based models are still inferior to EfficientNet which is obtained via a thorough architecture search, however,
**CMT-S is 0.6% higher than****EfficientNet****-B4 with less computational cost**, which demonstrates the efficacy of the proposed hybrid structure.

## 3.3. Other Downstream Tasks

For

object detectionwith RetinaNet as basic framework,CMT-S outperformsTwins-PCPVT-S with 1.3% mAP andTwins-SVT-S with 2.0% mAP.

For

instance segmentationwith Mask R-CNN as basic framework,CMT-S surpassesTwins-PCPVTS with 1.7% AP andTwins-SVT-S with 1.9% AP.

CMT-S outperforms otherTransformer-based models in all datasets with less FLOPs, and achievescomparable performance againstEfficientNet-B7 with 9× less FLOPs, which demonstrates the superiority of CMT architecture.