Reading: Cutout — Improved Regularization of Convolutional Neural Networks (Image Classification)

Regularization on Input, Improve the Accuracy of ResNet, WRN, and Shake-Shake.

Sik-Ho Tsang
4 min readOct 2, 2020
Cutout applied to images from the CIFAR-10 dataset.

In this story, Improved Regularization of Convolutional Neural Networks with Cutout (Cutout), by University of Guelph, and Canadian Institute for Advanced Research and Vector Institute, is shortly presented.

  • Due to the model capacity required to capture such representations, they are often susceptible to overfitting and therefore require proper regularization in order to generalize well.
  • In this paper, simple regularization technique of randomly masking out square regions of input during training is proposed, which is called cutout, can be used to improve the robustness and overall performance.

This is a paper in 2017 arXiv with over 500 citations. (Sik-Ho Tsang @ Medium)


  1. Motivation
  2. Differences from Dropout
  3. Experimental Results

1. Motivation

  • The main motivation for cutout comes from the problem of object occlusion, which is commonly encountered in many computer vision tasks, such as object recognition, tracking, or human pose estimation.
  • By generating new images which simulate occluded examples, we not only better prepare the model for encounters with occlusions in the real world, but the model also learns to take more of the image context into consideration when making decisions.

This technique encourages the network to better utilize the full context of the image, rather than relying on the presence of a small set of specific visual features.

  • This method, cutout, can be interpreted as applying a spatial prior to dropout in input space, much in the same way that convolutional neural networks leverage information about spatial structure in order to improve performance over that of feed-forward networks.

2. Differences from Dropout

  • With considering applying noise in a similar fashion to dropout, there are two important distinctions:
  1. The first difference is that units are dropped out only at the input layer of a CNN, rather than in the intermediate feature layers.
  2. The second difference is that we drop out contiguous sections of inputs rather than individual pixels.

3. Experimental Results

3.1. CIFAR10, CIFAR100, SHVN

Cutout patch length with respect to validation accuracy with 95% confidence intervals (average of five runs).
  • The above figures depict the grid searches conducted on CIFAR-10 and CIFAR-100 respectively.
  • Based on these validation results we select a cutout size of 16×16 pixels to use on CIFAR-10 and a cutout size of 8×8 pixels for CIFAR-100 when training on the full datasets.
  • Similarly, for SHVN, to find the optimal size for the cutout region we conduct a grid search using 10% of the training set for validation and ultimately select a cutout size of 20×20 pixels.
Test error rates (%) on CIFAR (C10, C100) and SVHN datasets
  • Cutout yields these performance improvements even when applied to complex models that already utilize batch normalization, dropout, and data augmentation.
  • Cutout improves ResNet, WRN, and Shake-Shake.
  • Adding cutout to the current state-of-the-art Shake-Shake regularization models improves performance by 0.3 and 0.6 percentage points on CIFAR-10 and CIFAR-100 respectively, yielding new state-of- the-art results of 2.56% and 15.20% test error.
  • WRN-16–8 plus cutoff, an average reduction in test error of 0.3 percentage points is observed, resulting in a new state-of-the-art performance of 1.30% test error.

3.2. STL-10

Test error rates on STL-10 dataset. “+” indicates standard data augmentation (mirror + crop). Results averaged over five runs on full training set.
  • While the main purpose of the STL-10 dataset is to test semi-supervised learning algorithms, authors use it to observe how cutout performs when applied to higher resolution images in a low data setting.
  • For this reason, the unlabeled portion of the dataset is discarded and only the labeled training set is used.
  • A grid search is performed over the cutout size parameter using 10% of the training images as a validation set and select a square size of 24×24 pixels for the no data augmentation case and 32×32 pixels for training STL-10 with data augmentation.
  • Training the model using these values yields a reduction in test error of 2.7 percentage points in the no data augmentation case, and 1.5 percentage points when also using data augmentation.

3.3. Analysis of Cutout’s Effect on Activations

Magnitude of feature activations, sorted by descending value, and averaged over all test samples
  • The shallow layers of the network experience a general increase in activation strength, while in deeper layers, we see more activations in the tail end of the distribution.
  • The latter observation illustrates that cutout is indeed encouraging the network to take into account a wider variety of features when making predictions, rather than relying on the presence of a smaller number of features.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.