Review — MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer
MobileViT, Combines Convolutions and Transformers
MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer,
MobileViT, by Apple
2022 ICLR, Over 180 Citations (Sik-Ho Tsang @ Medium)
Image Classification, ViT, Transformer
==== My Other Paper Readings Also Over Here ====
- Unlike CNNs, ViTs are heavyweight. In this paper, authors ask the following question:
- Is it possible to combine the strengths of CNNs and ViTs to build a light-weight and low latency network for mobile vision tasks?
- Towards this end, MobileViT, a light-weight and general-purpose ViT for mobile devices, is introduced. MobileViT presents a different perspective for the global processing of information with Transformers.
- To the best of authors’ knowledge, this is the first work that shows that light-weight ViTs can achieve light-weight CNN-level performance with simple training recipes across different mobile vision tasks.
Outline
- MobileViT Block
- MobileViT Model Architecture
- Results
1. MobileViT Block
1.1. Standard ViT
- A standard ViT model reshapes the input X of size H×W×C into a sequence of flattened patches Xf of size N×PC, projects it into a fixed d-dimensional space Xp of size N×d, and then inter-patch representations are learnt using a stack of L Transformer blocks.
- The computational cost of self-attention in ViTs is O(N²d).
1.2. MobileViT Block
- For a given input X of size H×W×C applies a n×n standard convolutional layer followed by a point-wise (or 1×1) convolutional layer to produce XL of size H×W×d. The n×n convolutional layer encodes local spatial information while the point-wise convolution projects the tensor to a high-dimensional space (or d-dimensional, where d>C).
- XL is unfolded into N non-overlapping flattened patches XU of size P×N×d, where P=wh, N=HW/P.
- For each p ∈ {1, …, P}, inter-patch relationships are encoded by applying Transformers to obtain XG of size P×N×d as:
- Therefore, XG is folded to obtain XF of size H×W×d. XF is then projected to low C-dimensional space using a point-wise convolution and combined with X via concatenation operation.
Because XU(p) encodes local information from n×n region using convolutions and XG(p) encodes global information across P patches for the p-th location, each pixel in XG can encode information from all pixels in X. Thus, the overall effective receptive field of MobileViT is H×W.
- MobileViT block can be viewed as Transformers as convolutions and can be used out-of-the-box, which allowing us to use MobileViT on different devices without any extra effort.
2. MobileViT Model Architecture
2.1. Model Architecture
- Unlike these models, MobileViT uses convolutions and Transformers in a way that the resultant MobileViT block has convolution-like properties while simultaneously allowing for global processing. This modeling capability allows us to design shallow and narrow MobileViT models, which in turn are light-weight.
Compared to the ViT-based model DeiT that uses L=12 and d=192, MobileViT model uses L={2, 4, 3} and d={96, 120, 144} at spatial levels 32×32, 16×16, and 8×8, respectively. The resulting MobileViT network is faster (1.85×), smaller (2×), and better (+1.8%) than DeiT network.
- MobileViT is O(N²Pd), which seems to be larger than ViT one. Yet, MobileViT is shallower which makes it more efficient than ViT.
- Three different network sizes (S: small, XS: extra small, and XXS: extra extra small) are designed.
- Swish activation is used. Following CNN models, n=3.
- The spatial dimensions of feature maps are usually multiples of 2 and h,w≤n. Therefore, h=w=2 is used at all spatial levels.
- The MV2 blocks in MobileViT network are mainly responsible for down-sampling.
- MobileViT does not require any positional embeddings.
2.2. Multi-Scale Sampler for Training Efficiency
- Variably-sized batch sizes are used for different GPUs. Multi-scale sampler (S = {(160, 160), (192, 192), (256; 256), (288, 288), (320, 320)}), is used. (Please refer to CVNets for more details.)
3. Results
3.1. Image Classification
(a) & (b): MobileViT outperforms light-weight CNNs across different network sizes.
(c): MobileViT delivers better performance than heavyweight CNNs.
- MobileViT also compares with ViTs without any Distillations.
Unlike ViT variants that benefit significantly from advanced augmentation, MobileViT achieves better performance with fewer parameters and basic augmentation.
3.2. Object Detection
(a): For the same input resolution of 320×320, SSDLite with MobileViT outperforms SSDLite with other light-weight CNN models.
(b): Further, SSDLite with MobileViT outperforms standard SSD-300 with heavy-weight backbones while learning significantly fewer parameters.
3.3. Semantic Segmentation
DeepLabv3 with MobileViT is smaller and better. MobileViT gives competitive performance to model with ResNet-101 while requiring 9× fewer parameters.
3.4. Inference Time
- Pre-trained full-precision MobileViT models are converted to CoreML using publicly available CoreMLTools (2021). Their inference time is then measured (average over 100 iterations) on a mobile device, i.e., iPhone 12.
- The inference time of MobileViT networks with two patch size settings (Config-A: 2, 2, 2 and Config-B: 8, 4, 2) on three different tasks. Here p1, p2, p3 in Config-X denotes the height h (width w=h) of a patch at an output stride of 8, 16, and 32, respectively.
The models with smaller patch sizes (Config-A) are more accurate as compared to larger patches (Config-B). Config-B models are faster than Config-A
- MobileViT and other ViT-based networks (e.g., DeiT and PiT) are slower as compared to MobileNetV2 on mobile devices. Similar to CNNs, the inference speed of MobileViT and ViTs will further improve with dedicated device-level operations in the future.
Reference
[2022 ICLR] [MobileViT]
MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer
1.1. Image Classification
2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer] [Sparse MLP] [MViTv2] [S²-MLP] [CycleMLP] [MobileOne] [GC ViT] [VAN] [ACMix] [CVNets] [MobileViT] 2023 [Vision Permutator (ViP)]