Review — CMT: Convolutional Neural Networks Meet Vision Transformers

CMT, CNN Meets Vision Transformer

5 min readSep 5, 2023

**CMTs, better trade-off for accuracy and efficiency on (a) ImageNet, and (b) COCO**

CMT: Convolutional Neural Networks Meet Vision Transformers
CMT, by University of Sydney, Huawei Noah’s Ark Lab
2022 CVPR, Over 300 Citations (Sik-Ho Tsang @ Medium)
Image Classification
1989 … 2023 [Vision Permutator (ViP)] [ConvMixer] [CrossFormer++]
==== My Other Paper Readings Are Also Over Here ====

A new Transformer based hybrid network is proposed by taking advantage of Transformers to capture long-range dependencies, and of CNNs to extract local information.
A family of models, CMTs, is constructed, which has much better trade-off for accuracy and efficiency, as shown above.

Outline

CMT: Architecture
CMT: Complexity, Scaling, and Variants
Results

1. CMT: Architecture

**(a)** **ResNet-50, (b)** **DeiT-S, (c) CMT-S**

The proposed CMT block consists of a local perception unit (LPU), a lightweight multi-head self-attention (LMHSA) module, and an inverted residual feed-forward network (IRFFN).

1.1. Local Perception Unit (LPU)

The absolute positional encoding used in previous Transformers, initially designed to leverage the order of tokens, damages translation-invariance.

To alleviate the limitations, LPU is proposed to extract local information using depth-wise convolution (MobileNetV1), which is defined as:

1.2. Lightweight Multi-Head Self-Attention (LMHSA)

In original self-attention module, the self-attention module is:

To mitigate the computation overhead, a k × k depth-wise convolution with stride k (MobileNetV1) is used to reduce the spatial size of K and V before the attention operation. Also, a relative position bias B (Similar to Shaw NAACL’18) is added to each self-attention module:

And there are h heads, similar to ViT.

1.3. Inverted Residual Feed-forward Network (IRFFN)

Original FFN uses two linear layers with GELU in between:

IRFFN is proposed, which consists of an expansion layer followed by a depth-wise convolution (MobileNetV1) and a projection layer. Specifically, the location of shortcut connection is also modified for better performance:

The depth-wise convolution (MobileNetV1) is used to extract local information with negligible extra computational cost.

1.4. CMT Block

With the aforementioned three components, the CMT block can be formulated as:

where Yi and Zi denote the output features of LPU and LMHSA module for the i-th block, respectively. LN denotes the layer normalization.

2. CMT: Complexity, Scaling, and Variants

2.1. Model Complexity

The computational complexity (FLOPs) of a Transformer can be calculated as:

where r is the expansion ratio of FFN, dk and dv are dimensions of key and value, respectively.
ViT sets d = dk = dv and r = 4, the cost can be simplified as:

The FLOPs of CMT block:

where k ≥ 1 is the reduction ratio in LMHSA.

Compared to standard Transformer block, the CMT block is more friendly to computational cost, and is easier to process the feature map under higher resolution (larger n).

2.2. Scaling Strategy

Inspired by EfficientNet, a compound coefficient ϕ is used to uniformly scale the number of layers (depth), dimensions, and input resolution:

A constraint of α·β^(1.5) ·γ² ≈ 2.5 is added so that for a given new ϕ, the total FLOPS will approximately increase by 2.5^ϕ.
α=1.2, β=1.3, and γ=1.15 is set empirically.

2.3. CMT Variants

Based on CMT-S, CMT-Ti, CMT-XS and CMT-B are built according to the proposed scaling strategy. The input resolutions are 160², 192², 224², and 256² for all four models, respectively.

3. Results

3.1. Ablation Study

ViT/DeiT can only generate single-scale feature map, losing a lot of multi-scale information, which is crucial for dense prediction.

When DeiT is with 4 stages like CMT-S, i.e. DeiT-S-4Stage, improvements can be achieved.

All the incremental improvements show that stem, LPU and IRFFN are also important to contribute the improved performance.

CMT maintains the LN before LMHSA and IRFFN, and inserts BN after the convolutional layer.

If all LNs are replaced by BNs, the model cannot converge during training.

Unidimensional scaling strategies are significantly inferior to theproposed compound scaling strategy,

3.2. ImageNet

CMTS achieves 83.5% top-1 accuracy with 4.0B FLOPs, which 3.7% higher than the baseline model DeiT-S and 2.0% higher than CPVT, indicating the benefit of CMT block for capturing both local and global information.

Note that all previous Transformer-based models are still inferior to EfficientNet which is obtained via a thorough architecture search, however, CMT-S is 0.6% higher than EfficientNet-B4 with less computational cost, which demonstrates the efficacy of the proposed hybrid structure.

3.3. Other Downstream Tasks

**Object detection results on COCO val2017.**

For object detection with RetinaNet as basic framework, CMT-S outperforms Twins-PCPVT-S with 1.3% mAP and Twins-SVT-S with 2.0% mAP.

**Instance segmentation results on COCO val2017.**

For instance segmentation with Mask R-CNN as basic framework, CMT-S surpasses Twins-PCPVTS with 1.7% AP and Twins-SVT-S with 1.9% AP.

CMT-S outperforms other Transformer-based models in all datasets with less FLOPs, and achieves comparable performance against EfficientNet-B7 with 9× less FLOPs, which demonstrates the superiority of CMT architecture.

Review — CMT: Convolutional Neural Networks Meet Vision Transformers

CMT, CNN Meets Vision Transformer

Outline

1. CMT: Architecture

1.1. Local Perception Unit (LPU)

1.2. Lightweight Multi-Head Self-Attention (LMHSA)

1.3. Inverted Residual Feed-forward Network (IRFFN)

1.4. CMT Block

2. CMT: Complexity, Scaling, and Variants

2.1. Model Complexity

2.2. Scaling Strategy

2.3. CMT Variants

3. Results

3.1. Ablation Study

3.2. ImageNet

3.3. Other Downstream Tasks

Written by Sik-Ho Tsang

No responses yet