Reading: Deep Roots — Improving CNN Efficiency with Hierarchical Filter Groups (Image Classification)
In this story, Deep Roots, by University of Cambridge, and Microsoft Research, is briefly presented.
For a convolution, it is unlikely that every filter (or neuron) in a deep neural network needs to depend on the output of all the filters in the previous layer. In fact, reducing filter co-dependence in deep networks has been shown to benefit generalization.
In this paper:
- By using hierarchical filter groups, much smaller model and less computation is obtained.
- Various architectures are validated by evaluating on the CIFAR10 and ILSVRC datasets.
This is a paper in 2017 CVPR with more than 100 citations. (Sik-Ho Tsang @ Medium)
1. Convolution with Filter Groups in AlexNet
- In AlexNet, ‘filter groups’ in the convolutional layers of a CNN is used while their use of filter groups was necessitated by the practical need to sub-divide the work of training a large network across multiple GPUs.
- The side effect is surprising that the AlexNet network has approximately 57% fewer connection weights.
- Despite the large difference in the number of parameters between the models, both achieve comparable accuracy on ILSVRC — in fact the smaller grouped network gets 1% lower top-5 validation error.
2. Root Module: Architecture
- The filter groups as shown in (b) and (c) are used to force the network to learn filters with only limited dependence on previous layers.
- This reduced connectivity also reduces computational complexity and model size since the size of filters in filter groups are reduced drastically.
- A root module: has a given number of filter groups, the more filter groups, the fewer the number of connections to the previous layer’s outputs. Each spatial convolutional layer is followed by a low-dimensional embedding (1×1 convolution).
3. Root Module in NIN
3.1. NIN Variants
- NIN (Orig): It is composed of 3 spatial (5×5, 3×3) convolutional layers with a large number of filters (192).
- The original number of filters per layer is preserved but subdivided them into groups.
- Compared to the baseline architecture, the root variants achieve a significant reduction in computation and model size without a significant reduction in accuracy.
- For example, the root-8 architecture gives equivalent accuracy with only 46% of the floating point operations (FLOPS), 33% of the model parameters of the original network, and approximately 37% and 23% faster CPU and GPU timings.
- The inter-layer correlation between the adjacent filter layers conv2c and conv3a in the network is shown above.
- The block-diagonalization enforced by the filter group structure is visible, more so with larger number of filter groups. This shows that the network learns an organization of filters such that the sparsely distributed strong filter relations.
3.3. Grouping Degree with Network Depth
- We might consider having the degree of grouping:
- (1) decrease with depth after the first convolutional layer, e.g. 1–8–4 (‘root’);
- (2) remain constant with depth after the first convolutional layer, e.g. 1–4–4 (‘column’);
- or (3) increase with depth, e.g. 1–4–8 (‘tree’).
- The results show that the so-called root topology gives the best performance, providing the smallest reduction in accuracy for a given reduction in model size and computational complexity.
4. Root Module in ResNet
4.1. ResNet Variant
- ResNet-50 has 50 convolutional layers, of which one-third are spatial convolutions (non-1×1).
- The spatial convolutional layers of the original network are replaced with root modules.
4.2. ResNet-50 on ILSVRC
- Similar results as the NIN one.
- For example, the best result by accuracy(root-16), exceeds the baseline accuracy by 0.2% while reducing the model size by 27% and floating-point operations (multiplyadd) by 37%. CPU timings were 23% faster, while GPU timings were 13% faster.
- With a drop in accuracy of only 0.1% however, the root-64 model reduces the model size by 40%, and reduces the floating point operations by 45%. CPU timings were 31% faster, while GPU timings were 12% faster.
4.3. ResNet-200 on ILSVRC
- The models trained with roots have comparable or lower error, with fewer parameters and less computation.
- The root-64 model has 27% fewer FLOPS and 48% fewer parameters than ResNet-200.
5. Root Module in GoogLeNet
5.1. GoogLeNet Variant
- For all of the networks, grouped filters within each of the ‘spatial’ convolutions (3×3, 5×5) are applied.
5.2. GoogLeNet on ILSVRC
- For many of the configurations the top-5 accuracy remains within 0.5% of the baseline model.
- The highest accuracy result, is 0.1% off the top-5 accuracy of the baseline model, but has a 0.1% higher top-1 accuracy.
- While maintaining the same accuracy, this network has 9% faster CPU and GPU timings.
- However, a model with only 0.3% lower top-5 accuracy than the baseline has much higher gains in computational efficiency — 44% fewer floating point operations (multiplyadd), 7% fewer model parameters, 21% faster CPU and 16% faster GPU timings.
It has been a long time not reading a CVPR paper about image classification.
This is the 7th story in this month!
[2017 CVPR] [Deep Roots]
Deep Roots: Improving CNN Efficiency with Hierarchical Filter Groups
[LeNet] [AlexNet] [Maxout] [NIN] [ZFNet] [VGGNet] [Highway] [SPPNet] [PReLU-Net] [STN] [DeepImage] [SqueezeNet] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [ResNet-38] [Shake-Shake] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet] [DRN] [DPN] [Residual Attention Network] [DMRNet / DFN-MR] [IGCNet / IGCV1] [Deep Roots] [MSDNet] [ShuffleNet V1] [SENet] [NASNet] [MobileNetV2] [CondenseNet]