Brief Review — MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features

MobileViTv3, Made 4 Changes on MobileViTv1

Sik-Ho Tsang
4 min read1 day ago
Comparing Top-1 accuracies of MobileViTv3, ViT variants and hybrid models on ImageNet-1K dataset

MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features
MobileViTv3
, by Micron Technology Inc.
2022 arXiv v2, Over 70 Citations (

@ Medium)

Image Classification
1989 … 2023
[Vision Permutator (ViP)] [ConvMixer] [CrossFormer++] [FastViT] [EfficientFormerV2] [MobileViTv2]
==== My Other Paper Readings Are Also Over Here ====

  • The fusion block inside MobileViTv1-block creates scaling challenges and has a complex learning task.
  • In this paper, authors propose changes to the fusion block that are simple and effective to create MobileViTv3-block, which addresses the scaling and simplifies the learning task.
  • (Unfortunately, it got rejected in 2023 ICLR according to OpenReview.)

Outline

  1. MobileViTv3
  2. Results

1. MobileViTv3

  • 4 changes are made in MobileViTv3 with respect to MobileViTv1.
  • 3 changes are made in MobileViTv3 with respect to MobileViTv2.

1.1. Replacing 3x3 convolutional layer with 1x1 convolutional layer in fusion block

First, fuse local and global features independent of other locations in the feature map to simplify the fusion block’s learning task.

  • Second, is to remove one of the major constraints in scaling of MobileViTv1 architecture. Scaling MobileViTv1 from XXS to S is done by changing width of the network and keeping depth constant. Changing width (number of input and output channels) of MobileViTv1 block causes large increase in number of parameters and FLOPs.

1.2. Local and Global features fusion

In fusion layer, features from local and global representation blocks are concatenated in MobileViTv3 block instead of input and global representation features.

  • This is because the local representation features are more closely related to the global representation features when compared to the input features.

1.3. Fusing input features

  • Input features are added to the output of 1x1 convolutional layer in the fusion block.

By adding the input features to the output in the fusion block, a residual connection is introduced in new MobileViTv3 architecture.

1.4. Depthwise convolutional layer in local representation block

To further reduce parameters, 3x3 convolutional layer in local representation block is replaced with depthwise 3x3 convolutional layer.

1.5. Scaling Up Building Blocks

Scaling Up Building Blocks

MobileViTv3-S, XS and XXS architectures are obtained by scaling the width.

2. Results

2.1. MobileViT Comparisons

MobileViT Comparisons
  • MobileViTv3-S, XS and XXS models trained with basic data augmentation not only outperforms MobileViTv1-S, XS, XXS, but also surpasses performance of MobileViTv2–1.0, 0.75 and 0.5 which are trained with advanced data augmentation.

2.2. ViT Comparisons

ViT Comparisons
  • Models under 2 million parameters: To the best of authors’ knowledge, only MobileViT variants exist in this range. MobileViTv3-XXS and MobileViTv3–0.5 outperform other MobileViT variants. MobileViTv3–0.5 by far achieves the best accuracy of 72.33 %.
  • Models between 2–4 million parameters: MobileViTv3-XS and MobileViTv3–0.75 outperform all the models in this range.
  • Models between 4–8 million parameters: MobileViTv3-S attains the highest accuracy of 79.3% in this parameter range.

2.3. CNN Comparisons

CNN Comparisons
  • Models in 1–2 million parameters range: MobileViTv3–0.5 and MobileViTv3-XXS with 72.33% and 70.98% respectively are best accuracies in this range.
  • Models with 2–4 million parameters: MobileViTv3-XS achieves over 4% improvement compared to MobileNetv3-Large(0.75), ShuffleNetv2(1.5), ESPNetv2–284M and MobileNetv2(0.75).
  • Models with 4–8 million parameters: MobileViTv3-S shows more than 2% accuracy gain over EfficientNet-B0, ESPNetv2–602M and MobileNetv3-Large(1.25).

2.4. Semantic Segmentation

Semantic Segmentation

On PASCAL VOC, MobileViTv3 models with lower training batch size of 48, outperforming their corresponding counterpart models of MobileViTv1 and MobileViTv2 which are trained on higher batch size of 128.

On ADE20K, MobileViTv3–1.0, 0.75 and 0.5 models outperform MobileViTv2–1.0, 0.75 and 0.5 models by 2.07%, 1.73% and 1.64% respectively.

2.5. Object Detection

Object Detection on MS COCO

Comparison with light-weight CNNs in Table 4a, MobileViTv3-XS outperforms MobileViTv1-XS by 0.8% and MNASNet by 2.6% mAP.

Comparison with heavy-weight CNNs in Table 4b. MobileViTv3-XS and MobileViTv3–1.0 surpasses MobileViTv1-XS and MobileViTv2–1.0 by 0.8% and 0.5% mAP respectively.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.