Review — Augment Your Batch: Improving Generalization Through Instance Repetition

Batch Augmentation (BA), Augments Samples within the Batch

Sik-Ho Tsang
4 min readMar 24, 2022

Augment Your Batch: Improving Generalization Through Instance Repetition, Batch Augmentation (BA), by ETH Zurich, and Technion
2020 CVPR, Over 50 Citations (Sik-Ho Tsang @ Medium)
Image Classification, Data Augmentation, Machine Translation, Dropout

  • Batch Augmentation is proposed: replicating instances of samples within the same batch with different data augmentations.
  • Batch augmentation reduces the number of necessary SGD updates to achieve the same accuracy.


  1. Batch Augmentation (BA)
  2. Experimental Results

1. Batch Augmentation (BA)

1.1. Vanilla SGD Without BA

  • Consider a model with a loss function ℓ(w, xn, yn) where {xn, yn} where n is from 1 to N, is a dataset of N data sample-target pairs, where xn X and T: XX is some data augmentation transformation applied to each example, e.g., a random crop of an image.
  • The common training procedure for each batch consists of the following update rule (here using vanilla SGD with a learning-rate η and batch size of B, for simplicity):
  • where k(t) is sampled from [N/B]={1, …, N/B}, B(t) is the set of samples in batch t.

1.2. SGD with BA

  • BA suggests to introduce M multiple instances of the same input sample by applying the transformation Ti, here denoted by subscript i ∈ [M] to denote the difference of each transformation.
  • The learning rule is modified as follows:
  • where a batch of M·B composed of B samples is augmented by M different transformations.
  • This updated rule can be computed either by evaluating on the whole M·B batch or by accumulating M instances of the original gradient computation.
  • Using large batch updates as part of batch augmentations does not change the number of SGD iterations that are performed per epoch.

1.3. BA Applied on Intermediate Layers

  • BA can also be used to transform over intermediate layers.
  • For example, we can use the common Dropout to generate multiple instances of the same sample in a given layer.
  • And BA with Dropout is applied onto language task or machine translation task.

2. Experimental Results

2.1. Study of M on CIFAR

Impact of batch augmentation (ResNet44, Cifar10)
  • The above figure shows an improved validation convergence speed (in terms of epochs), with a significant reduction in final validation classification error.

This trend largely continues to improve as M is increased, consistent with the expectation.

  • In the experiment, a ResNet44 with Cutout is trained on Cifar10. 94.15% accuracy is achieved in only 23 epochs for ResNet44, whereas the baseline achieved 93.07% with over four times the number of iterations (100 epochs).
  • For AmoebaNet with M=12, 94.46% validation accuracy is reached after 14 epochs without any modification to the LR schedule.

2.2. SOTA Comparison

Validation accuracy (Top1) results for Cifar, ImageNet models, test perplexity result and BLEU score on Penn-Tree-Bank (PTB) and WMT datasets.
  • Two baselines: (1) “Fixed #Steps” — original regime with same number of training steps as BA (2) “Fixed #Samples” — where in addition, the same number of samples as BA were observed (using M·B batch size).
  • BA using Dropout is applied on language and machine translation task: PTB and WMT En-De.

Performance is improved with the use of BA on CIFAR, ImageNet, PTB and WMT En-De.

By comparing with “Fixed #Steps” and “Fixed #Samples”, BA augments the samples within the batch is essential to improve the performance



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.