Reading: IGCV3 — Interleaved Low-Rank Group Convolutions (Image Classification)
Outperforms MobileNetV2, MobileNetV1, ShuffleNet V1, NASNet-A, IGCV2, & IGCNet / IGCV1
In this story, IGCV3, by University of Science and Technology of China, and Microsoft Research Asia (MSRA), is briefly presented. In this paper:
- Inspired by the composition of structured sparse kernels, e.g., interleaved group convolutions (IGC), and composition of low-rank kernels, e.g., bottle-neck modules,
- IGCV3 is designed with the combination of the above two design patterns, using the composition of structured sparse low-rank kernels, to form a convolutional kernel.
This is a paper in 2018 BMVC with over 40 citations. (Sik-Ho Tsang @ Medium)
Outline
- Related Prior Arts
- Interleaved Low-Rank Group Convolutions: IGCV3
- Ablation Study
- SOTA Comparisons
1. Related Prior Arts
1.1. Interleaved Group Convolution (IGCV1)
- The IGCV1 block consists of primary and secondary group convolutions, which is mathematically formulated as follows:
- where P1 and P2 are permutation matrices. The kernel matrices W1 and W2 are block-wise sparse:
- And the block-wise sparse matrix actually is the Gi group convolutions.
1.2. Interleaved Structured Sparse Convolution (IGCV2)
- Here W1 corresponds to a channel-wise spatial convolution, and W2 to WL correspond to group point-wise convolutions.
1.3. MobileNetV1
- A MobileNetV1 block consists of a channel-wise spatial convolution and a point-wise convolution:
- where W1 andW2 corresponds to the channel-wise and point-wise convolution respectively.
- It is an extreme case of IGCV1: both channel-wise and point-wise convolutions are extreme group convolutions.
1.4. MobileNetV2
- MobileNetV2 block consists of a dense pointwise convolution, a channelwise spatial convolution, and a dense pointwise convolution.
- It uses an inverted bottleneck: the first pointwise convolution increases the width and the second one reduces the width.
- where W1 corresponds to the channel-wise 3×3 convolution, the kernel including K = 9 spatial positions, and W0 and W2 are two low-rank matrices.
2. Interleaved Low-Rank Group Convolutions: IGCV3
- The first group convolution is a group 1×1 convolution with G1 = 2 groups.
- The second is a channel-wise spatial convolution.
- The third is a group 1×1 convolution with G2 = 2 groups.
- It consists of a channel-wise spatial convolution, a low-rank group point-wise convolution with G1 groups that reduces the width and a low-rank group point-wise convolution with G2 groups which expands the width back.
- P1 and P2 are permutation matrices similar to permutation matrices given in IGCV1.
- W1 corresponds to the channel-wise 3×3 convolution.
- ˆW0 and W2 are low-rank structured sparse matrices. The two low-rank sparse matrices are mathematically formulated as follows,
3. Ablation Study
3.1. Deeper and wider networks
- IGCV3 adopts two group convolutions with G1 = 2 and G2 = 2 for the deeper version (IGCV3-D) and with G1 = 4 and G2 = 4 for the wider version (IGCV3-W). For the widest version, it follows the strict complementation condition in IGCV1.
- IGCV3-D performs the best since:
- (i) there are redundancies in feature dimensions, so further enlarging the width cannot bring about gains;
- (ii) the networks built by stacking bottlenecks improve the final performance with the increasing of depth.
3.2. ReLU Positions
- The second block (IGCV3 block) has obvious advantages over other blocks.
- (Not much explanation in the paper about the reasons)
3.3. Number of branches in group convolutions
- It is found that the first group convolution prefers to be denser.
- The third group convolution projects the high-dimensional features back to the low-dimensional space, which results in information loss. Therefore, reducing more kernels may have little effect on its performance.
- In the experiments, G1 = 2, G2 = 2 are adopted to reduce the memory cost, which also achieves a good performance.
4. SOTA Comparisons
4.1. Comparisons with IGCV1 and IGCV2
- IGCV3 outperforms the prior works slightly on CIFAR datasets, and achieves significant improvement about 1.5% on ImageNet.
4.2. Comparisons with Other Mobile Networks
- “Network s×” means reducing the number of parameters in “Network 1.0×” by s times.
- IGCV3 outperforms MobileNetV2 a lot with the similar number of parameters.
- Moreover, IGCV3 with 50% parameters still achieves a better performance, which has the same depth as MobileNetV2. The reason may be that the number of ReLU is half of MobileNetV2.
- “Network (α)” means scaling the number of filters in “Network (1.0)” by α times thus overall complexity will be roughly α² times of “Network (α)”.
- IGCV3 always outperforms other networks such as MobileNetV1, MobileNetV2, IGCV2, IGCNet / IGCV1, ShuffleNet V1, NASNet-A.
4.3. COCO Detection
- IGCV3 is used as a backbone for detection networks.
- It follow the original framework in SSDLite [31], but replace all the feature extraction blocks with IGCV3, denoted by “SSDLite2”.
- IGCV3 is slightly better than MobileNetV2 with fewer parameters, and outperforms YOLOv2 0.6% mAP with much fewer number of parameters.
The above ShuffleNet V1 has been extended as ShuffleNet V2. Also, I haven’t covered SSDLite. Hope I can review them in the coming future.
This is the 16th story in this month!
Reference
[2018 BMCV] [IGCV3]
IGCV3: Interleaved Low-Rank Group Convolutions for Efficient Deep Neural Networks
Image Classification
[LeNet] [AlexNet] [Maxout] [NIN] [ZFNet] [VGGNet] [Highway] [SPPNet] [PReLU-Net] [STN] [DeepImage] [SqueezeNet] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [ResNet-38] [Shake-Shake] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet] [DRN] [DPN] [Residual Attention Network] [DMRNet / DFN-MR] [IGCNet / IGCV1] [Deep Roots] [MSDNet] [ShuffleNet V1] [SENet] [NASNet] [MobileNetV2] [CondenseNet] [IGCV2] [IGCV3]