Review — MobileOne: An Improved One millisecond Mobile Backbone

MobileOne, Low Latency Design for Image Classification

(a) Top-1 accuracy on image classification vs latency on an iPhone 12 and (b) zoomed out area to include recent transformer architectures. © mAP on object detection vs Top-1 accuracy on image classification
  • Extensive analysis of different metrics is performed by deploying several mobile-friendly networks on a mobile device.
  • An efficient backbone MobileOne, with variants achieving an inference time under 1 ms on an iPhone12 with 75.9% top-1 accuracy on ImageNet, which has significantly smaller latency as above.

Outline

  1. Analysis on Latency
  2. MobileOne Architecture
  3. Results

1. Analysis on Latency

1.1. Correlations Between Latency, Parameters & FLOPs

Left: FLOPs vs Latency on iPhone12. Right: Parameter Count vs Latency on iPhone 12.
  • Many models with higher parameter count can have lower latency.
  • The convolutional models such as MobileNets have lower latency for similar FLOPs and parameter count than their Transformer counterparts.
Spearman rank correlation coefficient between latency-flops.

1.2. Activation Function Design Choice

Comparison of latency on mobile device of different activation functions in a 30-layer convolutional neural network.
  • With the same architecture except for activations, their latencies are drastically different.

1.3. Architectural Block Design Choice

Ablation on latency of different architectural blocks in a 30-layer convolutional neural network.
  • Memory access cost increases significantly in multi-branch architectures as activations from each branch have to be stored to compute the next tensor in the graph.
  • Architectural blocks that force synchronization like global pooling operations used in Squeeze-Excite block as in SENet also affect overall run-time due to synchronization costs.

2. MobileOne Architecture

MobileOne block has two different structures at train time and test time. Left: Train time MobileOne block with reparameterizable branches. Right: MobileOne block at inference where the branches are reparameterized.

1.1. Architecture During Training and Inference

  • Left: The basic block builds on the MobileNetV1 block of 3×3 depthwise convolution followed by 1×1 pointwise convolutions. Then re-parameterizable skip connection with batchnorm along with branches is introduced.
Comparison of Top-1 on ImageNet for various values of trivial over-parameterization factor k

The overparameterization factor k is a hyperparameter which is varied from 1 to 5.

1.2. Re-parameterization Process

  • For a convolutional layer of kernel size K, input channel dimension Cin and output channel dimension Cout, the weight matrix is denoted as W and bias is denoted as b’.
  • A batchnorm layer contains accumulated mean μ, accumulated standard deviation σ, scale γ and bias β.
  • batchnorm is folded into preceding convolutional layer in all the branches.
  • For skip connection, the batchnorm is folded to a convolutional layer with identity 1×1 kernel, which is then padded by K-1 zeros as described in RepVGG.
  • After obtaining the batchnorm folded weights in each branch, the weights and bias:
  • for convolution layer at inference is obtained, where M is the number of branches.
Plot of train and validation losses of MobileOne-S0 model.
Effect re-parametrizable branches on Top-1 ImageNet accuracy.

1.3. Model Scaling

MobileOne Network Specifications
  • MobileOne has similar depth scaling as MobileNetV2.
  • 5 different width scales are shown above. Authors do not explore scaling up of input resolution as both FLOPs and memory consumption increase.

1.4. Training

Ablation on various train settings for MobileOne-S2 showing Top-1 accuracy on Imagenet.
  • Some training strategies are used as above to improve the accuracy.

2. Results

2.1. Image Classification

Comparison of Top-1 Accuracy on ImageNet against recent train time overparameterization works.
Performance of various models on ImageNet-1k validation set.
  • MobileOne models have a lower latency even on CPU compared to competing methods. MobileOne-S4 has 2.3% better top-1 accuracy than EfficientNet-B0 while being faster by 7.3% on CPU.

2.2. Object Detection & Semantic Segmentation

(a) Quantitative performance of object detection on MS-COCO. (b) Quantitative performance of semantic segmentation on Pascal-VOC and ADE20K datasets.
  • (a) Object Detection: SSDLite framework is used. The proposed best model outperforms MnasNet by 27.8% and best version of MobileViT by 6.1%.
  • (b) Semantic Segmentation: DeepLabv3 framework is used. For VOC, the proposed model outperforms MobileViT by 1.3% and MobileNetV2 by 5.8%. Using the MobileOne-S1 backbone with a lower latency than the MobileNetV2–1.0 backbone, MobileOne still outperforms it by 2.1%.
  • For ADE20K, the proposed best variant outperforms MobileNetV2 by 12.0%. Using the smaller MobileOne-S1 backbone, MobileOne still outperforms it by 2.9%.
Left: Qualitative comparison of MobileOne-S2-SSDLite (middle) against MobileNetV2-SSDLite (left) and ground truth (right). The two models have similar latency. Right: Qualitative results on semantic segmentation. Legend reproduced from DeepLabv3.
  • Some qualitative results show that MobileOne has better detection and segmentation results.

--

--

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store