Review: MobileNetV2 — Light Weight Model (Image Classification)
Outperforms MobileNetV1, NASNet, and ShuffleNet V1
In this story, MobileNetV2, by Google, is briefly reviewed. In the previous version MobileNetV1, Depthwise Separable Convolution is introduced which dramatically reduce the complexity cost and model size of the network, which is suitable to Mobile devices, or any devices with low computational power. In MobileNetV2, a better module is introduced with inverted residual structure. Non-linearities in narrow layers are removed this time. With MobileNetV2 as backbone for feature extraction, state-of-the-art performances are also achieved for object detection and semantic segmentation. This is a paper in 2018 CVPR with more than 200 citations. (Sik-Ho Tsang @ Medium)
Outline
- MobileNetV2 Convolutional Blocks
- Overall Architecture
- Ablation Study
- Experimental Results
1. MobileNetV2 Convolutional Blocks
1.1. MobileNetV1
- In MobileNetV1, there are 2 layers.
- The first layer is called a depthwise convolution, it performs lightweight filtering by applying a single convolutional filter per input channel.
- The second layer is a 1×1 convolution, called a pointwise convolution, which is responsible for building new features through computing linear combinations of the input channels.
- ReLU6 is used here for comparison. (Actually, in MobileNetV1 tech report, I cannot find any hints that they use ReLU6… Maybe we need to check the codes in Github…), i.e. min(max(x, 0), 6) as follows:
- ReLU6 is used due to its robustness when used with low-precision computation, based on [27] MobileNetV1.
1.2. MobileNetV2
- In MobileNetV2, there are two types of blocks. One is residual block with stride of 1. Another one is block with stride of 2 for downsizing.
- There are 3 layers for both types of blocks.
- This time, the first layer is 1×1 convolution with ReLU6.
- The second layer is the depthwise convolution.
- The third layer is another 1×1 convolution but without any non-linearity. It is claimed that if ReLU is used again, the deep networks only have the power of a linear classifier on the non-zero volume part of the output domain.
- And there is an expansion factor t. And t=6 for all main experiments.
- If the input got 64 channels, the internal output would get 64×t=64×6=384 channels.
2. Overall Architecture
- where t: expansion factor, c: number of output channels, n: repeating number, s: stride. 3×3 kernels are used for spatial convolution.
- In typical, the primary network (width multiplier 1, 224×224), has a computational cost of 300 million multiply-adds and uses 3.4 million parameters. (Width multiplier is introduced in MobileNetV1.)
- The performance trade offs are further explored, for input resolutions from 96 to 224, and width multipliers of 0.35 to 1.4.
- The network computational cost up to 585M MAdds, while the model size vary between 1.7M and 6.9M parameters.
- To train the network, 16 GPU is used with batch size of 96.
3. Ablation Study
3.1. Impact of Linear Bottleneck
- With the removal of ReLU6 at the output of each bottleneck module, accuracy is improved.
3.2. Impact of Shortcut
- With shortcut between bottlenecks, it outperforms shortcut between expansions and the one without any residual connections.
4. Experimental Results
4.1. ImageNet Classification
- MobileNetV2 outperforms MobileNetV1 and ShuffleNet (1.5) with comparable model size and computational cost.
- With width multiplier of 1.4, MobileNetV2 (1.4) outperforms ShuffleNet (×2), and NASNet with faster inference time.
- As shown above, different input resolutions and width multipliers are used. It consistently outperforms MobileNetV1.
4.2. MS COCO Object Detection
- First, SSDLite is introduced by modifying the regular convolutions in SSD with depthwise separable convolutions (MobileNetV1 one).
- SSDLite dramatically reduces both parameter count and computational cost.
- MobileNetV2 + SSDLite achieves competitive accuracy with significantly fewer parameters and smaller computational complexity.
- And the inference time is faster than MobileNetV1 one.
- Notably, MobileNetV2 + SSDLite is 20× more efficient and 10× smaller while still outperforms YOLOv2 on COCO dataset.
4.3. PASCAL VOC 2012 Semantic Segmentation
- Here, MobileNetV2 is used as feature extractor for DeepLabv3.
- With the disabling of Atrous Spatial Pyramid Pooling (ASPP) as well as Multi-Scale and Flipping (MP), also changing the output stride from 8 to 16, mIOU of 75.32% is obtained, with far low of model size and computational cost.
Reference
[2018 CVPR] [MobileNetV2]
MobileNetV2: Inverted Residuals and Linear Bottlenecks
My Previous Reviews
Image Classification [LeNet] [AlexNet] [Maxout] [NIN] [ZFNet] [VGGNet] [Highway] [SPPNet] [PReLU-Net] [STN] [DeepImage] [SqueezeNet] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [Shake-Shake] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet] [DRN] [DPN] [Residual Attention Network] [DMRNet / DFN-MR] [IGCNet / IGCV1] [MSDNet] [ShuffleNet V1] [SENet] [NASNet] [MobileNetV2]
Object Detection [OverFeat] [R-CNN] [Fast R-CNN] [Faster R-CNN] [MR-CNN & S-CNN] [DeepID-Net] [CRAFT] [R-FCN] [ION] [MultiPathNet] [NoC] [Hikvision] [GBD-Net / GBD-v1 & GBD-v2] [G-RMI] [TDM] [SSD] [DSSD] [YOLOv1] [YOLOv2 / YOLO9000] [YOLOv3] [FPN] [RetinaNet] [DCN]
Semantic Segmentation [FCN] [DeconvNet] [DeepLabv1 & DeepLabv2] [CRF-RNN] [SegNet] [ParseNet] [DilatedNet] [DRN] [RefineNet] [GCN] [PSPNet] [DeepLabv3]
Biomedical Image Segmentation [CUMedVision1] [CUMedVision2 / DCAN] [U-Net] [CFS-FCN] [U-Net+ResNet] [MultiChannel] [V-Net] [3D U-Net] [M²FCN] [SA] [QSA+QNT] [3D U-Net+ResNet]
Instance Segmentation [SDS] [Hypercolumn] [DeepMask] [SharpMask] [MultiPathNet] [MNC] [InstanceFCN] [FCIS]
Super Resolution [SRCNN] [FSRCNN] [VDSR] [ESPCN] [RED-Net] [DRCN] [DRRN] [LapSRN & MS-LapSRN] [SRDenseNet]
Human Pose Estimation [DeepPose] [Tompson NIPS’14] [Tompson CVPR’15] [CPM]