Review — MobileOne: An Improved One millisecond Mobile Backbone
MobileOne, Low Latency Design for Image Classification
An Improved One millisecond Mobile Backbone,
MobileOne, by Apple,
2022 arXiv v1, Over 5 Citations (Sik-Ho Tsang @ Medium)
- Extensive analysis of different metrics is performed by deploying several mobile-friendly networks on a mobile device.
- An efficient backbone MobileOne, with variants achieving an inference time under 1 ms on an iPhone12 with 75.9% top-1 accuracy on ImageNet, which has significantly smaller latency as above.
- Analysis on Latency
- MobileOne Architecture
1. Analysis on Latency
1.1. Correlations Between Latency, Parameters & FLOPs
- Many models with higher parameter count can have lower latency.
- The convolutional models such as MobileNets have lower latency for similar FLOPs and parameter count than their Transformer counterparts.
By measuring the Spearman rank correlation, latency is moderately correlated with FLOPs and weakly correlated with parameter counts for efficient architectures on a mobile device. This correlation is even lower on a desktop CPU.
1.2. Activation Function Design Choice
- With the same architecture except for activations, their latencies are drastically different.
Only ReLU activations are used in MobileOne.
1.3. Architectural Block Design Choice
- Memory access cost increases significantly in multi-branch architectures as activations from each branch have to be stored to compute the next tensor in the graph.
- Architectural blocks that force synchronization like global pooling operations used in Squeeze-Excite block as in SENet also affect overall run-time due to synchronization costs.
An architecture with no branches at inference is adopted, which results in smaller memory access cost. In addition, only the use of Squeeze-Excite blocks (SENet) is limited to the proposed biggest variant in order to improve accuracy.
2. MobileOne Architecture
1.1. Architecture During Training and Inference
- Left: The basic block builds on the MobileNetV1 block of 3×3 depthwise convolution followed by 1×1 pointwise convolutions. Then re-parameterizable skip connection with batchnorm along with branches is introduced.
The overparameterization factor k is a hyperparameter which is varied from 1 to 5.
Right: At inference, MobileOne model does not have any branches using the re-parameterization process.
1.2. Re-parameterization Process
- For a convolutional layer of kernel size K, input channel dimension Cin and output channel dimension Cout, the weight matrix is denoted as W’ and bias is denoted as b’.
- A batchnorm layer contains accumulated mean μ, accumulated standard deviation σ, scale γ and bias β.
Since convolution and batchnorm at inference are linear operations, they can be folded into a single convolution layer with weights and bias:
- batchnorm is folded into preceding convolutional layer in all the branches.
- For skip connection, the batchnorm is folded to a convolutional layer with identity 1×1 kernel, which is then padded by K-1 zeros as described in RepVGG.
- After obtaining the batchnorm folded weights in each branch, the weights and bias:
- for convolution layer at inference is obtained, where M is the number of branches.
Using re-parameterizable branches significantly improves performance.
1.3. Model Scaling
- MobileOne has similar depth scaling as MobileNetV2.
- 5 different width scales are shown above. Authors do not explore scaling up of input resolution as both FLOPs and memory consumption increase.
As MobileOne does not have a multi-branched architecture at inference, it does not incur data movement costs. This enables to aggressively scale model parameters compared to competing multi-branched architectures like MobileNetV2, EfficientNets, etc. without incurring significant latency cost. The increased parameter count enables MobileOne to generalize well to other computer vision tasks like object detection and semantic segmentation.
- Some training strategies are used as above to improve the accuracy.
2.1. Image Classification
MobileOne-S1 variant outperforms RepVGG-B0 which is ~3× bigger.
Current state-of-the-art MobileFormer  attains top-1 accuracy of 79.3% with a latency of 70.76ms, while MobileOne-S4 attains 79.4% with a latency of only 1.86ms which is 38× faster on mobile.
MobileOne-S3 has 1% better top-1 accuracy than EfficientNet-B0 and is faster by 11% on mobile.
- MobileOne models have a lower latency even on CPU compared to competing methods. MobileOne-S4 has 2.3% better top-1 accuracy than EfficientNet-B0 while being faster by 7.3% on CPU.
2.2. Object Detection & Semantic Segmentation
- (a) Object Detection: SSDLite framework is used. The proposed best model outperforms MnasNet by 27.8% and best version of MobileViT by 6.1%.
- (b) Semantic Segmentation: DeepLabv3 framework is used. For VOC, the proposed model outperforms MobileViT by 1.3% and MobileNetV2 by 5.8%. Using the MobileOne-S1 backbone with a lower latency than the MobileNetV2–1.0 backbone, MobileOne still outperforms it by 2.1%.
- For ADE20K, the proposed best variant outperforms MobileNetV2 by 12.0%. Using the smaller MobileOne-S1 backbone, MobileOne still outperforms it by 2.9%.
- Some qualitative results show that MobileOne has better detection and segmentation results.
[2022 arXiv v1] [MobileOne]
An Improved One millisecond Mobile Backbone
1.1. Image Classification
1989–2021 … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer] [Sparse MLP] [MViTv2] [S²-MLP] [CycleMLP] [MobileOne] 2023 [Vision Permutator (ViP)]