Review — MobileNeXt: Rethinking Bottleneck Structure for Efficient Mobile Network Design

MobileNeXt, Better Light Weight Model, Outperforms MobileNetV2

Sik-Ho Tsang
5 min readApr 14, 2023
MobileNeXt Outperforms MobileNetV2

Rethinking Bottleneck Structure for Efficient Mobile Network Design,
MobileNeXt, by National University of Singapore, Yitu Technology, and Institute of Data Science, NUS
2020 ECCV, Over 110 Citations (Sik-Ho Tsang @ Medium)

Image Classification
1989 … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer] [Sparse MLP] [MViTv2] [S²-MLP] [CycleMLP] [MobileOne] [GC ViT] [VAN] [ACMix] [CVNets] [MobileViT] [RepMLP] [RepLKNet] [ParNet] 2023 [Vision Permutator (ViP)]
==== My Other Paper Readings Are Also Over Here ====

  • The classic residual bottleneck, which uses learning inverted residuals and using linear bottlenecks, bring risks of information loss and gradient confusion.
  • In this paper, sandglass block is proposed to flip the bottleneck structure that performs identity mapping and spatial transformation at higher dimensions.

Outline

  1. The Proposed Sandglass Block
  2. MobileNeXt Model Architecture
  3. Results

1. The Proposed Sandglass Block

1.1. Conceptual Idea

Conceptual diagram of different residual bottleneck blocks. (a) Classic residual block with bottleneck structure in ResNet. (b) Inverted residual block in MobileNetV2. (c) The proposed sandglass block.
  • (a): Spatial convolution is performed in low-dimensional data which may have the risk of information loss.
  • (b): Skip connection is performed in low-dimensional data which may have the risk of gradient confusion.
  • (c): The proposed sandglass block performs both spatial convolution and skip connection at high-dimensional data.

1.2. Sandglass Block

Different types of residual blocks. (a) Classic bottleneck structure with depthwise spatial convolutions. (b) The proposed sandglass block with bottleneck structure.
Basic operator description of the proposed sandglass block. Here, ‘t’ and ‘s’ denote the channel reduction ratio and the stride, respectively.

In details, two pointwise convolutions for channel expansion and reduction are kept in the middle of the residual path for saving parameters and computation cost.

Two depthwise convolutions are placed at the ends of the residual path. Thereby both depthwise convolutions are conducted in high-dimensional spaces, richer feature representations can be extracted.

  • There is no activation layer after the reduction layer.
  • It is empirically found adding an activation layer after the last convolution can negatively influence the classification performance.

Therefore, activation layers are only added after the first depthwise convolutional layer and the last pointwise convolutional layer.

  • To the best of authors’ knowledge, this is the first work that attempts to investigate the advantages of the classic bottleneck structure over the inverted residual block for efficient network design.

2. MobileNeXt Model Architecture

2.1. Overall Architecture

Architecture details of the proposed MobileNeXt.
  • At the beginning of the network, there is a convolutional layer with 32 output channels. After that, the proposed sandglass blocks are stacked together.
  • The expansion ratio used in the network is set to 6 by default.
  • The output of the last building block is followed by a global average pooling layer to transform 2D feature maps to 1D feature vectors. A fully-connected layer is finally added to predict the final score for each category.

2.2. Identity Tensor Multiplier

  • There is no need to keep the whole identity tensor to combine with the residual path.
  • α, from 0 to 1, is introduced as Identity Tensor Multiplier which controls what portion of the channels in the identity tensor is preserved.
  • where Φ is the transformation function of the residual path in the block.

2.3. Model Variants

  • Five different width multipliers, including 1.4, 1.0, 0.75, 0.5, and 0.35, are used to create 5 model variants.

3. Results

3.1. Comparison with MobileNetV2

Comparisons with MobileNetV2 using different width multipliers with input resolution 224×224.

The proposed MobileNeXt with different multipliers all outperform MobileNetV2 with comparable numbers of parameters and computations.

Performance of the proposed MobileNeXt and MobileNetV2 after post-training quantization.

When the parameters and activations are quantized to 8 bits, the network outperforms MobileNetV2 by 3.55% under the same quantization settings.

Performance of the proposed network and MobileNetV2 when adding the number of spatial convolutions (Dwise convs) in each building block.

After adding one more depthwise convolution, the performance of MobileNetV2 increases to 73%, which is still far worse than MobileNeXt (74%) with even more learnable parameters and complexity.

3.2. SOTA Comparisons

Comparisons with other state-of-the-art models.
  • EfficientNet-b0 architecture is used and the inverted residual block is replaced with sandglass block.

With a comparable amount of computation and 20% parameter reduction, replacing the inverted residual block with sandglass block results in 0.4% top-1 classification accuracy improvement on ImageNet-1k dataset.

Model performance and latency comparisons with different identity tensor multipliers.
  • When half of the identity representations are removed, the performance has no drop but the latency is improved.
  • When the multiplier is set to 1/6, the performance decreases by 0.34%, but with further improvement in terms of latency.

3.3. Object Detection

Detection results on the Pascal VOC 2007 test set.

SSDLite with the proposed backbone improves the one with MobileNetV2 by nearly 1%.

3.4. NAS Using Sandglass Block for DARTS

Cell structures searched on CIFAR-10 with DARTS. (a) Searched normal cell structure. (b) Searched reduction cell structure. ‘SGBlock’ denotes the proposed sandglass block.
  • With Sandglass block is added as a new operator for NAS, the above normal cell and reduction cell are searched.
Results produced by different network architectures searched by DARTS on CIFAR-10

The resulting model achieves higher accuracy than the model with the original DARTS search space with about 25% parameter reduction.

  • However, the searched model with the inverted residual block added in the search space decreases the original performance.

This demonstrates that the proposed sandglass block can generate more expressive representations than the inverted residual block and

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.