Review: ResNet-38 — Wider or Deeper ResNet? (Image Classification & Semantic Segmentation)

A Good Compromise Between the Depth and Width, Outperforms DeepLabv2, FCN, CRF-RNN, DeconvNet, DilatedNet, Comparable with DeepLabv3, PSPNet.

5 min readAug 17, 2019

In this story, ResNet-38, by University of Adelaide, is reviewed. By in-depth investigation of the width and depth of ResNet, a good trade-off between the depth and width of the ResNet model is found. It outperforms the original ResNet in image classification. Finally, it also has good performance in semantic segmentation. This is a 2019 JPR (Journal of Pattern Recognition) paper with over 200 citations. (Sik-Ho Tsang @ Medium)

Outline

Unravelled View of ResNets
Wider or Deeper?
Image Classification Approach
Semantic Segmentation Approach
Image Classification Results
Semantic Segmentation Results

1. Unravelled View of ResNets

**Unravelled View of a Simple** **ResNet** **with only Two Residual Units**

Above is the unravelled view of a simple ResNet with only two Residual Units.
Some prior arts claimed that ResNet actually behaved as exponential ensembles of relatively shallow networks. However, the unravelled view cannot be treated as 4 shallow subnetworks: Ma, Mb, Mc, Md (Right of the figure).
Instead, it can only be treated as Ma, Mb and Me1/Me2 only.
Me cannot be further unravelled into Mc and Md.
Therefore, it is hard to tell whether Me is well-trained, or “fully-trained”.

2. Wider or Deeper?

In practice, algorithms are often limited by their spatial costs (GPU memory usages). One way is to use more devices, which will however increase communication costs among them.
With similar memory costs, a shallower but wider network can have times more number of trainable parameters.
And paths longer than the effective depth in ResNets are not “fully-trained”. That means, too deep ResNet cannot bring too obvious improvement or even worse.

3. Image Classification Approach

Pre-Activation ResNet is used. That means for batch norm and ReLU are performed each convolution.
Blue rectangle: Convolution step, Green triangle: Down-sampling,
And there are B1-B7 residual units. For B1-B5, there are two 3×3 convolutions For B6-B7, bottleneck structure is used.
When using input 224×224, B1 is removed due to limited GPU memory.

4. Semantic Segmentation Approach

Resolution: To generate score maps at 1/8 resolution, down-sampling operations are removed and dilation rates are increased in some convolutions.
Max pooling is harmful due to too strong spatial invariance.
Classifier: One convolution is added to make the channel number equals to number of pixel categories, e.g. 21 for PASCAL VOC 2012, denoted as “1 conv”.
One more 512-channel convolution can be added at the middle as well, denoted as “2 conv”.

5. Image Classification Results

Model A, with input 224×224, B1 removed, only depth of 38, it got 19.2% and 4.7% top-1 and top-5 errors respectively, outperforms ResNet, Inception-v4 and Inception-ResNet-v2.