Review: ResNet-38 — Wider or Deeper ResNet? (Image Classification & Semantic Segmentation)

A Good Compromise Between the Depth and Width, Outperforms DeepLabv2, FCN, CRF-RNN, DeconvNet, DilatedNet, Comparable with DeepLabv3, PSPNet.

Sik-Ho Tsang
5 min readAug 17, 2019

In this story, ResNet-38, by University of Adelaide, is reviewed. By in-depth investigation of the width and depth of ResNet, a good trade-off between the depth and width of the ResNet model is found. It outperforms the original ResNet in image classification. Finally, it also has good performance in semantic segmentation. This is a 2019 JPR (Journal of Pattern Recognition) paper with over 200 citations. (Sik-Ho Tsang @ Medium)

Outline

  1. Unravelled View of ResNets
  2. Wider or Deeper?
  3. Image Classification Approach
  4. Semantic Segmentation Approach
  5. Image Classification Results
  6. Semantic Segmentation Results

1. Unravelled View of ResNets

Unravelled View of a Simple ResNet with only Two Residual Units
  • Above is the unravelled view of a simple ResNet with only two Residual Units.
  • Some prior arts claimed that ResNet actually behaved as exponential ensembles of relatively shallow networks. However, the unravelled view cannot be treated as 4 shallow subnetworks: Ma, Mb, Mc, Md (Right of the figure).
  • Instead, it can only be treated as Ma, Mb and Me1/Me2 only.
  • Me cannot be further unravelled into Mc and Md.
  • Therefore, it is hard to tell whether Me is well-trained, or “fully-trained”.

2. Wider or Deeper?

  • In practice, algorithms are often limited by their spatial costs (GPU memory usages). One way is to use more devices, which will however increase communication costs among them.
  • With similar memory costs, a shallower but wider network can have times more number of trainable parameters.
  • And paths longer than the effective depth in ResNets are not “fully-trained”. That means, too deep ResNet cannot bring too obvious improvement or even worse.

3. Image Classification Approach

Proposed ResNets
  • Pre-Activation ResNet is used. That means for batch norm and ReLU are performed each convolution.
  • Blue rectangle: Convolution step, Green triangle: Down-sampling,
  • And there are B1-B7 residual units. For B1-B5, there are two 3×3 convolutions For B6-B7, bottleneck structure is used.
  • When using input 224×224, B1 is removed due to limited GPU memory.

4. Semantic Segmentation Approach

  • Resolution: To generate score maps at 1/8 resolution, down-sampling operations are removed and dilation rates are increased in some convolutions.
  • Max pooling is harmful due to too strong spatial invariance.
  • Classifier: One convolution is added to make the channel number equals to number of pixel categories, e.g. 21 for PASCAL VOC 2012, denoted as “1 conv”.
  • One more 512-channel convolution can be added at the middle as well, denoted as “2 conv”.

5. Image Classification Results

ILSVRC 2012 val set

6. Semantic Segmentation Results

6.1. PASCAL VOC

PASCAL VOC val set
  • Model A with “1 conv”: 78.76% mIoU.
  • Model A with “2 conv”: 80.84% mIoU.
  • They both outperform ResNet-101 and ResNet-152 by large margin.
PASCAL VOC test set

6.2. Cityscapes & ADE20K

Cityscapes val set and ADE20k val set
  • Model A2, initialize it using weights from Model A, and tune it with the Places 365 dataset, with “2 conv”: It performs the best.
Cityscapes test set
ADE20K test set
  • Here, multi-scale testing, model averaging, post-processing with CRFs are used. Again, Model A2 performs the best.

6.3. PASCAL-Context

PASCAL-Context val set
  • Model A2 with “2 conv” got 48.1% mIoU, outperforms DeepLabv2 by large margin.

6.4. Visualizations

PASCAL VOC 2012 val set
Cityscapes val set
  • There are still many visualizations for other datasets, please feel free to read the paper.

Reference

[2019 JPR] [ResNet-38]
Wider or Deeper: Revisiting the ResNet Model for Visual Recognition

My Previous Reviews

Image Classification [LeNet] [AlexNet] [Maxout] [NIN] [ZFNet] [VGGNet] [Highway] [SPPNet] [PReLU-Net] [STN] [DeepImage] [SqueezeNet] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [ResNet-38] [Shake-Shake] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet] [DRN] [DPN] [Residual Attention Network] [DMRNet / DFN-MR] [IGCNet / IGCV1] [MSDNet] [ShuffleNet V1] [SENet] [NASNet] [MobileNetV2]

Object Detection [OverFeat] [R-CNN] [Fast R-CNN] [Faster R-CNN] [MR-CNN & S-CNN] [DeepID-Net] [CRAFT] [R-FCN] [ION] [MultiPathNet] [NoC] [Hikvision] [GBD-Net / GBD-v1 & GBD-v2] [G-RMI] [TDM] [SSD] [DSSD] [YOLOv1] [YOLOv2 / YOLO9000] [YOLOv3] [FPN] [RetinaNet] [DCN]

Semantic Segmentation [FCN] [DeconvNet] [DeepLabv1 & DeepLabv2] [CRF-RNN] [SegNet] [ParseNet] [DilatedNet] [DRN] [RefineNet] [GCN] [PSPNet] [DeepLabv3] [ResNet-38] [LC] [FC-DenseNet] [IDW-CNN] [DIS] [SDN]

Biomedical Image Segmentation [CUMedVision1] [CUMedVision2 / DCAN] [U-Net] [CFS-FCN] [U-Net+ResNet] [MultiChannel] [V-Net] [3D U-Net] [M²FCN] [SA] [QSA+QNT] [3D U-Net+ResNet]

Instance Segmentation [SDS] [Hypercolumn] [DeepMask] [SharpMask] [MultiPathNet] [MNC] [InstanceFCN] [FCIS]

Super Resolution [SRCNN] [FSRCNN] [VDSR] [ESPCN] [RED-Net] [DRCN] [DRRN] [LapSRN & MS-LapSRN] [SRDenseNet]

Human Pose Estimation [DeepPose] [Tompson NIPS’14] [Tompson CVPR’15] [CPM]

Codec Post-Processing [ARCNN] [Lin DCC’16] [IFCNN] [Li ICME’17] [VRCNN] [DCAD] [DS-CNN]

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.