Review — MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer

MobileViT, Combines Convolutions and Transformers

Sik-Ho Tsang
6 min readFeb 15


MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer,
MobileViT, by Apple
2022 ICLR, Over 180 Citations (Sik-Ho Tsang @ Medium)
Image Classification, ViT, Transformer
==== My Other Paper Readings Also Over Here ====

  • Unlike CNNs, ViTs are heavyweight. In this paper, authors ask the following question:
  • Is it possible to combine the strengths of CNNs and ViTs to build a light-weight and low latency network for mobile vision tasks?
  • Towards this end, MobileViT, a light-weight and general-purpose ViT for mobile devices, is introduced. MobileViT presents a different perspective for the global processing of information with Transformers.
  • To the best of authors’ knowledge, this is the first work that shows that light-weight ViTs can achieve light-weight CNN-level performance with simple training recipes across different mobile vision tasks.


  1. MobileViT Block
  2. MobileViT Model Architecture
  3. Results

1. MobileViT Block

1.1. Standard ViT

  • A standard ViT model reshapes the input X of size H×W×C into a sequence of flattened patches Xf of size N×PC, projects it into a fixed d-dimensional space Xp of size N×d, and then inter-patch representations are learnt using a stack of L Transformer blocks.
  • The computational cost of self-attention in ViTs is O(N²d).

1.2. MobileViT Block

MobileViT Block
  • For a given input X of size H×W×C applies a n×n standard convolutional layer followed by a point-wise (or 1×1) convolutional layer to produce XL of size H×W×d. The n×n convolutional layer encodes local spatial information while the point-wise convolution projects the tensor to a high-dimensional space (or d-dimensional, where d>C).
  • XL is unfolded into N non-overlapping flattened patches XU of size P×N×d, where P=wh, N=HW/P.
  • For each p ∈ {1, …, P}, inter-patch relationships are encoded by applying Transformers to obtain XG of size P×N×d as:
  • Therefore, XG is folded to obtain XF of size H×W×d. XF is then projected to low C-dimensional space using a point-wise convolution and combined with X via concatenation operation.
Every pixel sees every other pixel in the MobileViT block. In this example, the red pixel attends to blue pixels. Because blue pixels have already encoded information about the neighboring pixels using convolutions, this allows the red pixel to encode information from all pixels in an image. Here, each cell in black and gray grids represents a patch and a pixel, respectively.

Because XU(p) encodes local information from n×n region using convolutions and XG(p) encodes global information across P patches for the p-th location, each pixel in XG can encode information from all pixels in X. Thus, the overall effective receptive field of MobileViT is H×W.

  • MobileViT block can be viewed as Transformers as convolutions and can be used out-of-the-box, which allowing us to use MobileViT on different devices without any extra effort.

2. MobileViT Model Architecture

2.1. Model Architecture

MobileViT Model Architecture
  • Unlike these models, MobileViT uses convolutions and Transformers in a way that the resultant MobileViT block has convolution-like properties while simultaneously allowing for global processing. This modeling capability allows us to design shallow and narrow MobileViT models, which in turn are light-weight.

Compared to the ViT-based model DeiT that uses L=12 and d=192, MobileViT model uses L={2, 4, 3} and d={96, 120, 144} at spatial levels 32×32, 16×16, and 8×8, respectively. The resulting MobileViT network is faster (1.85×), smaller (2×), and better (+1.8%) than DeiT network.

  • MobileViT is O(N²Pd), which seems to be larger than ViT one. Yet, MobileViT is shallower which makes it more efficient than ViT.
MobileViT shows similar generalization capabilities as CNNs.
  • Three different network sizes (S: small, XS: extra small, and XXS: extra extra small) are designed.
  • Swish activation is used. Following CNN models, n=3.
  • The spatial dimensions of feature maps are usually multiples of 2 and h,wn. Therefore, h=w=2 is used at all spatial levels.
  • The MV2 blocks in MobileViT network are mainly responsible for down-sampling.
  • MobileViT does not require any positional embeddings.

2.2. Multi-Scale Sampler for Training Efficiency

Multi-scale vs. standard sampler.
  • Variably-sized batch sizes are used for different GPUs. Multi-scale sampler (S = {(160, 160), (192, 192), (256; 256), (288, 288), (320, 320)}), is used. (Please refer to CVNets for more details.)

3. Results

3.1. Image Classification

MobileViT vs. CNNs on ImageNet-1k validation set.

(a) & (b): MobileViT outperforms light-weight CNNs across different network sizes.

(c): MobileViT delivers better performance than heavyweight CNNs.

MobileViT vs. ViTs on ImageNet-1k validation set.

Unlike ViT variants that benefit significantly from advanced augmentation, MobileViT achieves better performance with fewer parameters and basic augmentation.

3.2. Object Detection

Object detection results of SSDLite-MobileViT-S on the MS-COCO validation set.
Detection w/ SSDLite on MS COCO.

(a): For the same input resolution of 320×320, SSDLite with MobileViT outperforms SSDLite with other light-weight CNN models.

(b): Further, SSDLite with MobileViT outperforms standard SSD-300 with heavy-weight backbones while learning significantly fewer parameters.

3.3. Semantic Segmentation

Semantic segmentation results of DeepLabv3-MobileViT-S model on the unseen MS COCO validation set
Segmentation w/ DeepLabv3 on PASCAL VOC 2012.

DeepLabv3 with MobileViT is smaller and better. MobileViT gives competitive performance to model with ResNet-101 while requiring 9× fewer parameters.

3.4. Inference Time

Inference time of MobileViT models on different tasks. Here, dots in green color region represents that these models runs in real-time (inference time < 33 ms).
  • Pre-trained full-precision MobileViT models are converted to CoreML using publicly available CoreMLTools (2021). Their inference time is then measured (average over 100 iterations) on a mobile device, i.e., iPhone 12.
  • The inference time of MobileViT networks with two patch size settings (Config-A: 2, 2, 2 and Config-B: 8, 4, 2) on three different tasks. Here p1, p2, p3 in Config-X denotes the height h (width w=h) of a patch at an output stride of 8, 16, and 32, respectively.

The models with smaller patch sizes (Config-A) are more accurate as compared to larger patches (Config-B). Config-B models are faster than Config-A

ViTs are slower than CNNs. +: Results with multi-scale sampler
  • MobileViT and other ViT-based networks (e.g., DeiT and PiT) are slower as compared to MobileNetV2 on mobile devices. Similar to CNNs, the inference speed of MobileViT and ViTs will further improve with dedicated device-level operations in the future.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.