Review — Group Norm (GN): Group Normalization (Image Classification)

Suitable for Memory-Constraint Applications With Small Batch Size Like Object Detection & Segmentation, Outperforms Batch Norm (BN), Layer Norm (LN) & Instance Norm (IN)

Error Rate by Batch Norm (BN) Increases When Batch Size Decreasing While Group Norm (GN) Still Can Maintain Similar Error Rate
  • This limits BN’s usage for training larger models, like computer vision tasks including detection, segmentation, and video.
  • GN divides the channels into groups and computes within each group the mean and variance for normalization.
  • GN’s computation is independent of batch sizes, and its accuracy is stable in a wide range of batch sizes.

Outline

  1. Normalization Methods: BN, LN, IN
  2. Group Normalization (GN)
  3. Image Classification Results in ImageNet
  4. Object Detection and Segmentation in COCO

1. Normalization Methods: BN, LN, IN

Normalization methods
  • In Instance Norm (IN), IN computes μ and σ along the (H, W) axes for each sample and each channel.
  • All methods of BN, LN, and IN learn a per-channel linear transform to compensate for the possible lost of representational ability:

2. Group Normalization (GN)

Python code of Group Norm based on TensorFlow
  • GN computes μ and σ along the (H,W) axes and along a group of C/G channels.
  • In the above figure (rightmost), it is a simple case of 2 groups (G = 2) each having 3 channels.
  • Specifically, the pixels in the same group are normalized together by the same μ and σ. GN also learns the per-channel γ and β.
  • GN can be easily implemented by a few lines of code in either Pytorch and TensorFlow, as shown above.

2.1. Relation to Layer Normalization

  • GN becomes LN if we set the group number as G = 1. GN is less restricted than LN, because each group of channels (instead of all of them) are assumed to subject to the shared mean and variance.
  • The model still has flexibility of learning a different distribution for each group. This leads to improved representational power of GN over LN.

2.2. Relation to Instance Normalization

  • GN becomes IN if we set the group number as G = C (i.e., one channel per group). But IN can only rely on the spatial dimension for computing the mean and variance and it misses the opportunity of exploiting the channel dependence.

3. Image Classification Results in ImageNet

3.1. Regular Batch Size

Comparison of error curves with a batch size of 32 images/GPU
Comparison of error rates (%) of ResNet-50 in the ImageNet validation set, trained with a batch size of 32 images/GPU
  • Actually, the above figure (left) shows that GN has lower training error than BN, indicating that GN is effective for easing optimization.
  • The slightly higher validation error of GN implies that GN loses some regularization ability of BN because BN’s mean and variance computation introduces uncertainty caused by the stochastic batch sampling, which helps regularization.

3.2. Small Batch Size

Sensitivity to batch sizes
Sensitivity to batch sizes
  • GN has very similar curves (subject to random variations) across a wide range of batch sizes from 32 to 2.
  • In the case of a batch size of 2, GN has 10.6% lower error rate than its BN counterpart (24.1% vs. 34.7%).

3.3. Group Division

Group division.
  • In the extreme case of G = 1, GN is equivalent to LN.
  • In the extreme case of 1 channel per group, GN is equivalent to IN.

3.4 Results and Analysis of VGG Models

Evolution of feature distributions of conv5_3 (the last convolutional layer)
  • GN and BN behave qualitatively similar. This comparison suggests that performing normalization is essential for controlling the distribution of features.

4. Object Detection and Segmentation in COCO

  • Evaluation is also performed by fine-tuning the models for transferring to object detection and segmentation.

4.1. Results of C4 Backbone

  • On this baseline, GN improves over BN* by 1.1 box AP and 0.8 mask AP. (BN* is the BN without fine-tuning. With fine-tuning, BN result is worse.)
  • BN* creates inconsistency between pre-training and fine-tuning (frozen).

4.2. Results of FPN Backbone

Detection and segmentation ablation results in COCO, using Mask R-CNN with ResNet-50 C4.
  • (Applying BN to the box head is tried but with 9 AP worse.)
  • Applying GN to the backbone alone contributes a 0.5 AP gain (from 39.5 to 40.0), suggesting that GN helps when transferring features.
Detection and segmentation results in COCO using Mask R-CNN and FPN
  • It is found that GN is not fully trained with the default schedule. By training with more iterations (long), the final ResNet-50 GN model (“long”) is 2.2 points box AP and 1.6 points mask AP better than its BN* variant.

4.3. Training Mask R-CNN from Scratch

Detection and segmentation results trained from scratch in COCO using Mask R-CNN and FPN
  • They can even compete with the ImageNet-pretrained results in the above table at Section 4.2.

4.4. Video Classification in Kinetics

Video classification results in Kinetics
  • The models are pre-trained from ImageNet. For both BN and GN, we extend the normalization from over (H, W) to over (T, H, W). The model is fully convolutional in spacetime.
  • For the batch size of 8, GN is slightly worse than BN by 0.3% top-1 accuracy and 0.1% top-5. GN is competitive with BN when BN works well.
  • For the smaller batch size of 4, GN’s accuracy is kept similar (72.8 / 90.6 vs. 73.0 / 90.6), but is better than BN’s 72.1 / 90.0.
  • BN’s accuracy is decreased by 1.2% when the batch size decreases from 8 to 4.
Error curves in Kinetics with an input length of 32 frames

--

--

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store