Review — Micro-Batch Training with Batch-Channel Normalization and Weight Standardization

Weight Standardization (WS) and Batch-Channel Normalization (BCN) are Proposed

Micro-Batch Training with Batch-Channel Normalization and Weight Standardization,
WS, BCN, by Johns Hopkins University
2020 arXiv v2, Over 40 Citations (Sik-Ho Tsang @ Medium)
Image Classification, Group Normalization (GN), Weight Normalization (WN),

  • Weight Standardization (WS) standardizes the weights in convolutional layers.
  • Batch-Channel Normalization (BCN) combines batch and channel normalizations and leverages estimated statistics of the activations in convolutional layers.


  1. Weight Standardization (WS)
  2. Batch-Channel Normalization (BCN)
  3. Experimental Results

1. Weight Standardization (WS)

Comparing normalization methods on activations (blue) and Weight Standardization (orange)

1.1. WS

  • Consider a standard convolutional layer with its bias term set to 0:
  • In Weight Standardization (WS), instead of directly optimizing the loss L on the original weights ^W, the weights ^W are reparameterized as a function of W, i.e. ^W=WS(W).
  • where:
  • The loss L is optimized on W by SGD:
Computation graph for WS in feed-forwarding and backpropagation
  • (.W is the intermediate symbol used in the paper.)

1.2. Comparing WS with WN and CWN

  • Later, Centered WN (CWN) adds a centering operation for WN:
  • (Please feel free to read WN and CWN for more details if interested.)
  • To compare with WN and CWN, WS considers the weights for only one of the output channel and reformulate the corresponding weights output as:
  • And the learnable length g is also removed.

2. Batch-Channel Normalization (BCN)

  • Batch Normalization is estimated across batch. When batch size is small, BN harms the training.
  • Batch-Channel Normalization (BCN) is proposed, which can be used for micro-batch training.
Micro-Batch BCN
  • ^μc and ^σc are not updated by the gradients computed from the loss function; instead, they are updated towards more accurate estimates of those statistics (Step 3 and Step 4).
  • BCN has a channel normalization following the estimate-based normalization. This makes the previously unstable estimate-based normalization stable.
  • (Some details need to be confirmed by reading the codes.)

3. Experimental Results

3.1. Image Classification

Top-1 Accuracy on ImageNet

GN+WS can be used together to improve the top-1 accuracy on ImageNet.

Error Rate on CIFAR-10 and CIFAR-100

While GN+WS has good performance, BCN+WS is even better.

3.2. Object Detection and Instance Segmentation

Object detection and instance segmentation results on COCO val2017 of Mask R-CNN and FPN with ResNet-50 and ResNet-101 as backbone

Similar trends are observed in Object Detection and Instance Segmentation on MS COCO Val 2017.

Later, another arXiv paper uses WS on BYOL. Please stay tuned.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store