[Paper] CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features (Image Classification)

Outperforms DropBlock, ShakeDrop, Cutout and mixup

Sik-Ho Tsang
5 min readNov 22, 2020
CutMix: Patches are cut and pasted among training image

In this story, CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features (CutMix), by NAVER Corp., LINE Plus Corp., and Yonsei University, is shortly presented. In this paper:

  • Patches are cut and pasted among training images where the ground truth labels are also mixed proportionally to the area of the patches.
  • By making efficient use of training pixels and retaining the regularization effect of regional dropout, CutMix consistently outperforms the state-of-the-art augmentation strategies.

This is a paper in 2019 ICCV with over 190 citations. (

@ Medium)


  1. CutMix
  2. Comparison with Cutout and mixup
  3. Experimental Results

1. CutMix

  • The goal of CutMix is to generate a new training sample (˜x, ˜y) by combining two training samples (xA, yA) and (xB, yB). The generated training sample (˜x, ˜y) is used to train the model with its original loss function.
  • where M ∈ {0,1}W×H denotes a binary mask indicating where to drop out and fill in from two images.
  • The combination ratio λ between two data points is sampled from the beta distribution Beta(α,α). In all experiments, α is set to 1, that is λ is sampled from the uniform distribution (0, 1).
  • The major difference is that CutMix replaces an image region with a patch from another training image and generates more locally natural image than mixup does.
  • To sample the binary mask M, we first sample the bounding box coordinates B = (rx, ry, rw, rh) indicating the cropping regions on xA and xB.
  • The region B in xA is removed and filled in with the patch cropped from B of xB.
  • The box coordinates are uniformly sampled according to:
  • making the cropped area ratio:

2. Comparison with Cutout and mixup

Comparison among mixup, Cutout, and CutMix
Class activation mapping (CAM)
  • CutMix is indeed learning to recognize two objects from their respective partial views.
  • Cutout successfully lets a model focus on less discriminative parts of the object, while being inefficient due to unused pixels.
  • mixup, on the other hand, makes full use of pixels, but introduces unnatural artifacts.
  • CutMix efficiently improves upon Cutout by being able to localize the two object classes accurately.

3. Experimental Results

3.1. Ablation Study

  • Left: CutMix with α ∈ {0.1, 0.25, 0.5, 1.0, 2.0, 4.0} are evaluated. The best performance is achieved when α = 1.0.
  • Right: Feature-level CutMix is tested. 0=image level, 1=after first conv-bn, 2=after layer1, 3=after layer2, 4=after layer3. CutMix achieves the best performance when it is applied on the input images.

3.2. ImageNet

ImageNet classification results based on ResNet-50
  • CutMix achieves the best result, 21.40% top-1 error, among the considered augmentation strategies.
  • CutMix outperforms Cutout and mixup, the two closest approaches to ours, by +1.53% and +1.18%, respectively.
  • On the feature level as well, It is found that CutMix preferable to mixup, with top-1 errors 21.78% and 22.50%, respectively.
ImageNet classification results based on ResNet-101 and ResNeXt-101
  • For deeper model, it is observed that +1.60% and +1.71% respective improvements on ResNet-101 and ResNeXt in top-1 errors due to CutMix.

3.3. CIFAR

Error Rates on CIFAR-100
  • PyramidNet-200 is used as baseline.
  • Both Cutout and label smoothing from Inception-v3 does not improve the accuracy when adopted independently, but they are effective when used together.
  • DropBlock, the feature-level generalization of Cutout , is also more effective when label smoothing is also used.
  • mixup and Manifold Mixup achieve higher accuracies when Cutout is applied on input images.
  • CutMix achieves 14.47% top-1 classification error on CIFAR-100, +1.98% higher than the baseline performance 16.45%. A new state-of-the-art performance 13.81% by combining CutMix and ShakeDrop.
Impact of CutMix on lighter architectures on CIFAR-100.
  • CutMix also significantly improves the performance of the weaker baseline architectures, such as PyramidNet-110 and ResNet-110.
Impact of CutMix on CIFAR-10.
  • On CIFAR-10, CutMix also enhances the classification performances by +0.97%, outperforming mixup and Cutout performances.

There are also experiments for weakly supervised object localization, Pascal VOC object detection, MS-COCO image captioning, robustness and uncertainty, as well as occlusions. If interested, please feel free to read the paper.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.