Review — CMT: Convolutional Neural Networks Meet Vision Transformers

CMT, CNN Meets Vision Transformer

Sik-Ho Tsang
5 min readSep 5


CMTs, better trade-off for accuracy and efficiency on (a) ImageNet, and (b) COCO

CMT: Convolutional Neural Networks Meet Vision Transformers
, by University of Sydney, Huawei Noah’s Ark Lab
2022 CVPR, Over 300 Citations (Sik-Ho Tsang @ Medium)

Image Classification
1989 … 2023
[Vision Permutator (ViP)] [ConvMixer] [CrossFormer++]
==== My Other Paper Readings Are Also Over Here ====

  • A new Transformer based hybrid network is proposed by taking advantage of Transformers to capture long-range dependencies, and of CNNs to extract local information.
  • A family of models, CMTs, is constructed, which has much better trade-off for accuracy and efficiency, as shown above.


  1. CMT: Architecture
  2. CMT: Complexity, Scaling, and Variants
  3. Results

1. CMT: Architecture

(a) ResNet-50, (b) DeiT-S, (c) CMT-S

The proposed CMT block consists of a local perception unit (LPU), a lightweight multi-head self-attention (LMHSA) module, and an inverted residual feed-forward network (IRFFN).

1.1. Local Perception Unit (LPU)

  • The absolute positional encoding used in previous Transformers, initially designed to leverage the order of tokens, damages translation-invariance.

To alleviate the limitations, LPU is proposed to extract local information using depth-wise convolution (MobileNetV1), which is defined as:

1.2. Lightweight Multi-Head Self-Attention (LMHSA)

  • In original self-attention module, the self-attention module is:

To mitigate the computation overhead, a k × k depth-wise convolution with stride k (MobileNetV1) is used to reduce the spatial size of K and V before the attention operation. Also, a relative position bias B (Similar to Shaw NAACL’18) is added to each self-attention module:

  • And there are h heads, similar to ViT.

1.3. Inverted Residual Feed-forward Network (IRFFN)

  • Original FFN uses two linear layers with GELU in between:

IRFFN is proposed, which consists of an expansion layer followed by a depth-wise convolution (MobileNetV1) and a projection layer. Specifically, the location of shortcut connection is also modified for better performance:

  • The depth-wise convolution (MobileNetV1) is used to extract local information with negligible extra computational cost.

1.4. CMT Block

With the aforementioned three components, the CMT block can be formulated as:

  • where Yi and Zi denote the output features of LPU and LMHSA module for the i-th block, respectively. LN denotes the layer normalization.

2. CMT: Complexity, Scaling, and Variants

2.1. Model Complexity

  • The computational complexity (FLOPs) of a Transformer can be calculated as:
  • where r is the expansion ratio of FFN, dk and dv are dimensions of key and value, respectively.
  • ViT sets d = dk = dv and r = 4, the cost can be simplified as:
  • The FLOPs of CMT block:
  • where k ≥ 1 is the reduction ratio in LMHSA.

Compared to standard Transformer block, the CMT block is more friendly to computational cost, and is easier to process the feature map under higher resolution (larger n).

2.2. Scaling Strategy

Inspired by EfficientNet, a compound coefficient ϕ is used to uniformly scale the number of layers (depth), dimensions, and input resolution:

  • A constraint of α·β^(1.5) ·γ² ≈ 2.5 is added so that for a given new ϕ, the total FLOPS will approximately increase by 2.5^ϕ.
  • α=1.2, β=1.3, and γ=1.15 is set empirically.

2.3. CMT Variants

CMT variants.

Based on CMT-S, CMT-Ti, CMT-XS and CMT-B are built according to the proposed scaling strategy. The input resolutions are 160², 192², 224², and 256² for all four models, respectively.

3. Results

3.1. Ablation Study

Stage-wise architecture.
  • ViT/DeiT can only generate single-scale feature map, losing a lot of multi-scale information, which is crucial for dense prediction.

When DeiT is with 4 stages like CMT-S, i.e. DeiT-S-4Stage, improvements can be achieved.

CMT block.

All the incremental improvements show that stem, LPU and IRFFN are also important to contribute the improved performance.

  • CMT maintains the LN before LMHSA and IRFFN, and inserts BN after the convolutional layer.

If all LNs are replaced by BNs, the model cannot converge during training.

Scaling strategy.

Unidimensional scaling strategies are significantly inferior to theproposed compound scaling strategy,

3.2. ImageNet

ImageNet Results of CMT.

CMTS achieves 83.5% top-1 accuracy with 4.0B FLOPs, which 3.7% higher than the baseline model DeiT-S and 2.0% higher than CPVT, indicating the benefit of CMT block for capturing both local and global information.

  • Note that all previous Transformer-based models are still inferior to EfficientNet which is obtained via a thorough architecture search, however, CMT-S is 0.6% higher than EfficientNet-B4 with less computational cost, which demonstrates the efficacy of the proposed hybrid structure.

3.3. Other Downstream Tasks

Object detection results on COCO val2017.

For object detection with RetinaNet as basic framework, CMT-S outperforms Twins-PCPVT-S with 1.3% mAP and Twins-SVT-S with 2.0% mAP.

Instance segmentation results on COCO val2017.

For instance segmentation with Mask R-CNN as basic framework, CMT-S surpasses Twins-PCPVTS with 1.7% AP and Twins-SVT-S with 1.9% AP.

Transfer Learning Results.

CMT-S outperforms other Transformer-based models in all datasets with less FLOPs, and achieves comparable performance against EfficientNet-B7 with 9× less FLOPs, which demonstrates the superiority of CMT architecture.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.