Review — MobileOne: An Improved One millisecond Mobile Backbone

MobileOne, Low Latency Design for Image Classification

5 min readJan 25, 2023

--

(a) Top-1 accuracy on image classification vs latency on an iPhone 12 and (b) zoomed out area to include recent transformer architectures. © mAP on object detection vs Top-1 accuracy on image classification

An Improved One millisecond Mobile Backbone,
MobileOne, by Apple,
2022 arXiv v1, Over 5 Citations (Sik-Ho Tsang @ Medium)
Image Classification

Extensive analysis of different metrics is performed by deploying several mobile-friendly networks on a mobile device.
An efficient backbone MobileOne, with variants achieving an inference time under 1 ms on an iPhone12 with 75.9% top-1 accuracy on ImageNet, which has significantly smaller latency as above.

Outline

Analysis on Latency
MobileOne Architecture
Results

1. Analysis on Latency

1.1. Correlations Between Latency, Parameters & FLOPs

**Left: FLOPs vs Latency on iPhone12. Right: Parameter Count vs Latency on iPhone 12.**

Many models with higher parameter count can have lower latency.
The convolutional models such as MobileNets have lower latency for similar FLOPs and parameter count than their Transformer counterparts.

**Spearman rank correlation coefficient between latency-flops.**

By measuring the Spearman rank correlation, latency is moderately correlated with FLOPs and weakly correlated with parameter counts for efficient architectures on a mobile device. This correlation is even lower on a desktop CPU.

1.2. Activation Function Design Choice

**Comparison of latency on mobile device of different activation functions in a 30-layer convolutional neural network.**

With the same architecture except for activations, their latencies are drastically different.

Only ReLU activations are used in MobileOne.

1.3. Architectural Block Design Choice

**Ablation on latency of different architectural blocks in a 30-layer convolutional neural network.**

Memory access cost increases significantly in multi-branch architectures as activations from each branch have to be stored to compute the next tensor in the graph.
Architectural blocks that force synchronization like global pooling operations used in Squeeze-Excite block as in SENet also affect overall run-time due to synchronization costs.

An architecture with no branches at inference is adopted, which results in smaller memory access cost. In addition, only the use of Squeeze-Excite blocks (SENet) is limited to the proposed biggest variant in order to improve accuracy.

2. MobileOne Architecture

MobileOne block has two different structures at train time and test time. Left: Train time MobileOne block with reparameterizable branches. Right: MobileOne block at inference where the branches are reparameterized.

1.1. Architecture During Training and Inference

Left: The basic block builds on the MobileNetV1 block of 3×3 depthwise convolution followed by 1×1 pointwise convolutions. Then re-parameterizable skip connection with batchnorm along with branches is introduced.

**Comparison of Top-1 on ImageNet for various values of trivial over-parameterization factor k**

The overparameterization factor k is a hyperparameter which is varied from 1 to 5.

Right: At inference, MobileOne model does not have any branches using the re-parameterization process.

1.2. Re-parameterization Process

For a convolutional layer of kernel size K, input channel dimension Cin and output channel dimension Cout, the weight matrix is denoted as W’ and bias is denoted as b’.
A batchnorm layer contains accumulated mean μ, accumulated standard deviation σ, scale γ and bias β.

Since convolution and batchnorm at inference are linear operations, they can be folded into a single convolution layer with weights and bias:

batchnorm is folded into preceding convolutional layer in all the branches.
For skip connection, the batchnorm is folded to a convolutional layer with identity 1×1 kernel, which is then padded by K-1 zeros as described in RepVGG.
After obtaining the batchnorm folded weights in each branch, the weights and bias:

for convolution layer at inference is obtained, where M is the number of branches.

**Plot of train and validation losses of MobileOne-S0 model.**

**Effect re-parametrizable branches on Top-1 ImageNet accuracy.**

Using re-parameterizable branches significantly improves performance.

1.3. Model Scaling

MobileOne has similar depth scaling as MobileNetV2.
5 different width scales are shown above. Authors do not explore scaling up of input resolution as both FLOPs and memory consumption increase.

As MobileOne does not have a multi-branched architecture at inference, it does not incur data movement costs. This enables to aggressively scale model parameters compared to competing multi-branched architectures like MobileNetV2, EfficientNets, etc. without incurring significant latency cost. The increased parameter count enables MobileOne to generalize well to other computer vision tasks like object detection and semantic segmentation.

1.4. Training

**Ablation on various train settings for MobileOne-S2 showing Top-1 accuracy on Imagenet.**

Some training strategies are used as above to improve the accuracy.

2. Results

2.1. Image Classification

**Comparison of Top-1 Accuracy on ImageNet against recent train time overparameterization works.**

MobileOne-S1 variant outperforms RepVGG-B0 which is ~3× bigger.

**Performance of various models on ImageNet-1k validation set.**

Current state-of-the-art MobileFormer [5] attains top-1 accuracy of 79.3% with a latency of 70.76ms, while MobileOne-S4 attains 79.4% with a latency of only 1.86ms which is 38× faster on mobile.
MobileOne-S3 has 1% better top-1 accuracy than EfficientNet-B0 and is faster by 11% on mobile.

MobileOne models have a lower latency even on CPU compared to competing methods. MobileOne-S4 has 2.3% better top-1 accuracy than EfficientNet-B0 while being faster by 7.3% on CPU.

2.2. Object Detection & Semantic Segmentation

**(a) Quantitative performance of object detection on MS-COCO. (b) Quantitative performance of semantic segmentation on Pascal-VOC and** **ADE20K** **datasets.**

(a) Object Detection: SSDLite framework is used. The proposed best model outperforms MnasNet by 27.8% and best version of MobileViT by 6.1%.
(b) Semantic Segmentation: DeepLabv3 framework is used. For VOC, the proposed model outperforms MobileViT by 1.3% and MobileNetV2 by 5.8%. Using the MobileOne-S1 backbone with a lower latency than the MobileNetV2–1.0 backbone, MobileOne still outperforms it by 2.1%.
For ADE20K, the proposed best variant outperforms MobileNetV2 by 12.0%. Using the smaller MobileOne-S1 backbone, MobileOne still outperforms it by 2.9%.