Review: ResNeXt — 1st Runner Up in ILSVRC 2016 (Image Classification)

Network-in-Neuron, A New Dimensionality: Cardinality

Published in

Towards Data Science

7 min readDec 9, 2018

In this story, ResNeXt, by UC San Diego and Facebook AI Research (FAIR), is reviewed. The model name, ResNeXt, contains Next. It means the next dimension, on top of the ResNet. This next dimension is called the “cardinality” dimension. And ResNeXt becomes the 1st Runner Up of ILSVRC classification task.

**ILSVRC 2016 Classification Ranking** http://image-net.org/challenges/LSVRC/2016/results#loc

**Residual Block in ResNet (Left), A Block of ResNeXt with Cardinality = 32 (Right)**

Compared with ResNet (The winner in ILSVRC 2015, 3.57%) and PolyNet (2nd Runner Up, 3.04%, Team name CU-DeepLink), ResNeXt got 3.03% Top-5 error rate, which is a large relative improvement of about 15%!!

And it is published in 2017 CVPR, which has already got over 500 citations while I was writing this story. (Sik-Ho Tsang @ Medium)

What Are Covered

Aggregated Transformation
Relationship with Inception-ResNet, and Grouped Convolution in AlexNet
Full Architecture and Ablation Study
Results

1. Aggregated Transformation

1.1. Revisiting Simple Neuron

**A Simple Neuron (Left), and the corresponding equation (Right)**

As we should know, a simple neuron as above, the output is the summation of wi times xi. The above operation can be recast as a combination of splitting, transforming, and aggregating.

Splitting: the vector x is sliced as a low-dimensional embedding, and in the above, it is a single-dimension subspace xi.
Transforming: the low-dimensional representation is transformed, and in the above, it is simply scaled: wixi.
Aggregating: the transformations in all embeddings are aggregated by summation.

1.2. Aggregated Transformations

**A Block of ResNeXt with Cardinality = 32 (Left), and Its Generic Equation (Right)**

In contrast to “Network-in-Network”, it is “Network-in-Neuron” expands along a new dimension. Instead of linear function in a simple neuron that wi times xi in each path, a nonlinear function is performed for each path.

A new dimension C is introduced, called “Cardinality”. The dimension of cardinality controls the number of more complex transformations.

2. Relationship with Inception-ResNet and Grouped Convolution in AlexNet

**(a) ResNeXt Block, (b) Inception-ResNet Block, (c) Grouped Convolution**

To compare, the above 3 blocks are having the SAME INTERNAL DIMENSIONS within each block.

(a) ResNeXt Block (Left)

For each path, Conv1×1–Conv3×3–Conv1×1 are done at each convolution path. This is the bottleneck design in ResNet block. The internal dimension for each path is denoted as d (d=4). The number of paths is the cardinality C (C=32). If we sum up the dimension of each Conv3×3 (i.e. d×C=4×32), it is also the dimensions of 128.
The dimension is increased directly from 4 to 256, and then added together, and also added with the skip connection path.
Compared with Inception-ResNet that it needs to increase the dimension from 4 to 128 then to 256, ResNeXt requires minimal extra effort designing each path.
Unlike ResNet, in ResNeXt, the neurons at one path will not connected to the neurons at other paths.

(b) Inception-ResNet Block (Middle)

This is suggested in Inception-v4 to combine the Inception module and ResNet block. Somehow due to the legacy problem, for each convolution path, Conv1×1–Conv3×3 are done first. When added together (i.e. 4×32), the Conv3×3 has the dimension of 128.
Then the outputs are concatenated together with dimension of 128. And Conv1×1 is used to restore the dimensions from 128 to 256.
Finally the output is added with the skip connection path.
The main difference is that they have an early concatenation.

(c) Grouped Convolution in AlexNet (Right)

Conv1×1–Conv3×3–Conv1×1 are done at the convolution path, which is actually a bottleneck design suggested in ResNet. The Conv3×3 has the dimension of 128.
However, grouped convolution, suggested in AlexNet is used here. Therefore this Conv3×3 is wider but sparsely connected module. (Because the neurons at one path will not connected to the neurons at other paths, that’s why it is sparsely connected.)
Thus, there are 32 groups of convolutions. (2 groups only in AlexNet)
Then a skip connection is at parallel and added with the convolution path. Thus, the convolution path is learning the residual representation.

Though the structures in (b) and (c) are not always the same as the general form in the equation shown in 1.2, indeed authors have tried the above three structures as shown above, and they found that the results are the same.

Finally, authors choose to implement the structure in (c) because it is more succinct and faster than the other two forms.

3. Full Architecture and Ablation Study

3.1. Ablation Study of C and d Under Similar Complexity

Detailed Architecture (Left), Number of Parameters for Each Block (Top Right), Different Settings to Maintain Similar Complexity (Middle Right), Ablation Study for Different Settings Under Similar Complexity (Bottom Right)

ResNet-50 is a special case of ResNeXt-50 with C=1, d=64.
To have fair comparison, different ResNeXt with different C and d with similar complexity with ResNet are tried. This is shown at the middle right of the figure above.
And it is found that ResNeXt-50 (32×4d) obtains 22.2% top-1 error for ImageNet-1K (1K means 1K classes) dataset while ResNet-50 only obtains 23.9% top-1 error.
And ResNeXt-101 (32×4d) obtains 21.2% top-1 error for ImageNet dataset while ResNet-101 only obtains 22.0% top-1 error.

3.2. Importance of Cardinality

**Ablation Study for Different Settings of 2× Complexity Models**

ResNet-200: 21.7% top-1 and 5.8% top-5 error rates.
ResNet-101, wider: only obtains 21.3% top-1 and 5.7% top-5 error rates, which means only making it wider does not help much.
ResNeXt-101 (2×64d): By just making C=2 (i.e. two convolution paths within the ResNeXt block), an obvious improvement is already obtained with 20.7% top-1 and 5.5% top-5 error rates.
ResNeXt-101 (64×4d): By making C=64 (i.e. 64 convolution paths within the ResNeXt block), an even better improvement is already obtained with 20.4% top-1 and 5.3% top-5 error rates. This means cardinality is essential to improve the classification accuracy.

3.2. Importance of Residual Connections

Without residual connections, error rates are increased largely for both ResNet-50 and ResNeXt-50. Residual connections are important.

4. Results

4.1. ImageNet-1K

**Single Crop Testing: ResNet/ResNeXt is 224×224 and 320×320, Inception models: 299×299**

ImageNet-1K is a subset of 22K-class ImageNet dataset, which contains 1000 classes. It is also the dataset for ILSVRC classification task.

With standard size image used for single crop testing, ResNeXt-101 obtains 20.4% top-1 and 5.3% top-5 error rates,
With larger size image used for single crop testing, ResNeXt-101 obtains 19.1% top-1 and 4.4% top-5 error rates, which has better results than all state-of-the-art approaches, ResNet, Pre-Activation ResNet, Inception-v3, Inception-v4 and Inception-ResNet-v2.

4.2. ImageNet-5K

**ImageNet-5K Results (All trained from scratch)**

ImageNet-1K has been somehow saturated after so many years of development.

ImageNet-5K is a subset of 22K-class ImageNet dataset, which contains 5000 classes, which also contains ImageNet-1K classes.

6.8 million images, 5× of the ImageNet-1K dataset.
Since there is no official train/validation set, the original ImageNet-1K validation set is used for evaluation.
5K-way classification is the softmax over 5K classes. Thus, there will be automatic errors when the network predicts the labels for the other 4K classes on the ImageNet-1K validation dataset.
1K-way classification is just the softmax over 1K classes.

ResNeXt of course got better results than ResNet as shown above.

4.3. CIFAR-10 & CIFAR-100

CIFAR-10 & CIFAR-100, two very famous 10-class and 100-class datasets.

Left: Compared with ResNet, ResNeXt always obtains better results in CIFAR-10.
Right: Compared with Wide ResNet (WRN), ResNeXt-29 (16×64d) obtains 3.58% and 17.31% errors for CIFAR-10 and CIFAR-100 respectively. These were the best results among all state-of-the-art approaches at that moment.

4.4. MS COCO Object Detection

By plugging ResNet/ResNeXt into Faster R-CNN, with similar model complexity, ResNeXt always outperforms ResNet for both AP@0.5 (IoU>0.5) and mean AP (average prediction) at all IoU levels.

With the success of ResNeXt, it is also utilized by Mask R-CNN for instance segmentation. Hope I can cover Mask R-CNN later on as well.

Reference

[2017 CVPR] [ResNeXt]
Aggregated Residual Transformations for Deep Neural Networks

My Related Reviews on Image Classification

[LeNet] [AlexNet] [ZFNet] [VGGNet] [SPPNet] [PReLU-Net] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [PolyNet] [DenseNet]

My Related Reviews on Object Detection

[Faster R-CNN]