Review — Micro-Batch Training with Batch-Channel Normalization and Weight Standardization

Weight Standardization (WS) and Batch-Channel Normalization (BCN) are Proposed

Sik-Ho Tsang
4 min readJun 26, 2022

Micro-Batch Training with Batch-Channel Normalization and Weight Standardization,
WS, BCN, by Johns Hopkins University
2020 arXiv v2, Over 40 Citations (Sik-Ho Tsang @ Medium)
Image Classification, Group Normalization (GN), Weight Normalization (WN),

  • Weight Standardization (WS) standardizes the weights in convolutional layers.
  • Batch-Channel Normalization (BCN) combines batch and channel normalizations and leverages estimated statistics of the activations in convolutional layers.


  1. Weight Standardization (WS)
  2. Batch-Channel Normalization (BCN)
  3. Experimental Results

1. Weight Standardization (WS)

Comparing normalization methods on activations (blue) and Weight Standardization (orange)

1.1. WS

  • Consider a standard convolutional layer with its bias term set to 0:
  • In Weight Standardization (WS), instead of directly optimizing the loss L on the original weights ^W, the weights ^W are reparameterized as a function of W, i.e. ^W=WS(W).
  • where:
  • The loss L is optimized on W by SGD:
Computation graph for WS in feed-forwarding and backpropagation
  • (.W is the intermediate symbol used in the paper.)

1.2. Comparing WS with WN and CWN

  • Later, Centered WN (CWN) adds a centering operation for WN:
  • (Please feel free to read WN and CWN for more details if interested.)
  • To compare with WN and CWN, WS considers the weights for only one of the output channel and reformulate the corresponding weights output as:
  • And the learnable length g is also removed.

2. Batch-Channel Normalization (BCN)

  • Batch Normalization is estimated across batch. When batch size is small, BN harms the training.
  • Batch-Channel Normalization (BCN) is proposed, which can be used for micro-batch training.
Micro-Batch BCN
  • ^μc and ^σc are not updated by the gradients computed from the loss function; instead, they are updated towards more accurate estimates of those statistics (Step 3 and Step 4).
  • BCN has a channel normalization following the estimate-based normalization. This makes the previously unstable estimate-based normalization stable.
  • (Some details need to be confirmed by reading the codes.)

3. Experimental Results

3.1. Image Classification

Top-1 Accuracy on ImageNet

GN+WS can be used together to improve the top-1 accuracy on ImageNet.

Error Rate on CIFAR-10 and CIFAR-100

While GN+WS has good performance, BCN+WS is even better.

3.2. Object Detection and Instance Segmentation

Object detection and instance segmentation results on COCO val2017 of Mask R-CNN and FPN with ResNet-50 and ResNet-101 as backbone

Similar trends are observed in Object Detection and Instance Segmentation on MS COCO Val 2017.

Later, another arXiv paper uses WS on BYOL. Please stay tuned.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.