[Paper] CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features (Image Classification)

Outperforms DropBlock, ShakeDrop, Cutout and mixup

5 min readNov 22, 2020

**CutMix: Patches are cut and pasted among training image**

In this story, CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features (CutMix), by NAVER Corp., LINE Plus Corp., and Yonsei University, is shortly presented. In this paper:

Patches are cut and pasted among training images where the ground truth labels are also mixed proportionally to the area of the patches.
By making efficient use of training pixels and retaining the regularization effect of regional dropout, CutMix consistently outperforms the state-of-the-art augmentation strategies.

This is a paper in 2019 ICCV with over 190 citations. (Sik-Ho Tsang @ Medium)

Outline

CutMix
Comparison with Cutout and mixup
Experimental Results

1. CutMix

The goal of CutMix is to generate a new training sample (˜x, ˜y) by combining two training samples (xA, yA) and (xB, yB). The generated training sample (˜x, ˜y) is used to train the model with its original loss function.

where M ∈ {0,1}W×H denotes a binary mask indicating where to drop out and fill in from two images.
The combination ratio λ between two data points is sampled from the beta distribution Beta(α,α). In all experiments, α is set to 1, that is λ is sampled from the uniform distribution (0, 1).
The major difference is that CutMix replaces an image region with a patch from another training image and generates more locally natural image than mixup does.
To sample the binary mask M, we first sample the bounding box coordinates B = (rx, ry, rw, rh) indicating the cropping regions on xA and xB.
The region B in xA is removed and filled in with the patch cropped from B of xB.
The box coordinates are uniformly sampled according to:

making the cropped area ratio:

2. Comparison with Cutout and mixup

**Comparison among** **mixup**, **Cutout, and CutMix**

CutMix is indeed learning to recognize two objects from their respective partial views.
Cutout successfully lets a model focus on less discriminative parts of the object, while being inefficient due to unused pixels.
mixup, on the other hand, makes full use of pixels, but introduces unnatural artifacts.
CutMix efficiently improves upon Cutout by being able to localize the two object classes accurately.

3. Experimental Results

3.1. Ablation Study

Left: CutMix with α ∈ {0.1, 0.25, 0.5, 1.0, 2.0, 4.0} are evaluated. The best performance is achieved when α = 1.0.
Right: Feature-level CutMix is tested. 0=image level, 1=after first conv-bn, 2=after layer1, 3=after layer2, 4=after layer3. CutMix achieves the best performance when it is applied on the input images.

3.2. ImageNet

**ImageNet classification results based on** **ResNet-50**

CutMix achieves the best result, 21.40% top-1 error, among the considered augmentation strategies.
CutMix outperforms Cutout and mixup, the two closest approaches to ours, by +1.53% and +1.18%, respectively.
On the feature level as well, It is found that CutMix preferable to mixup, with top-1 errors 21.78% and 22.50%, respectively.

**ImageNet classification results based on** **ResNet-101 and** **ResNeXt-101**

For deeper model, it is observed that +1.60% and +1.71% respective improvements on ResNet-101 and ResNeXt in top-1 errors due to CutMix.

3.3. CIFAR

PyramidNet-200 is used as baseline.
Both Cutout and label smoothing from Inception-v3 does not improve the accuracy when adopted independently, but they are effective when used together.
DropBlock, the feature-level generalization of Cutout , is also more effective when label smoothing is also used.
mixup and Manifold Mixup achieve higher accuracies when Cutout is applied on input images.
CutMix achieves 14.47% top-1 classification error on CIFAR-100, +1.98% higher than the baseline performance 16.45%. A new state-of-the-art performance 13.81% by combining CutMix and ShakeDrop.

**Impact of CutMix on lighter architectures on CIFAR-100.**

CutMix also significantly improves the performance of the weaker baseline architectures, such as PyramidNet-110 and ResNet-110.

On CIFAR-10, CutMix also enhances the classification performances by +0.97%, outperforming mixup and Cutout performances.

There are also experiments for weakly supervised object localization, Pascal VOC object detection, MS-COCO image captioning, robustness and uncertainty, as well as occlusions. If interested, please feel free to read the paper.

Reference

[2019 ICCV] [CutMix]
CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features

Image Classification

1989–1998: [LeNet]
2012–2014: [AlexNet & CaffeNet] [Maxout] [Dropout] [NIN] [ZFNet] [SPPNet]
2015: [VGGNet] [Highway] [PReLU-Net] [STN] [DeepImage] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2]
2016: [SqueezeNet] [Inception-v3] [ResNet] [Pre-Activation ResNet] [RiR] [Stochastic Depth] [WRN] [Trimps-Soushen]
2017: [Inception-v4] [Xception] [MobileNetV1] [Shake-Shake] [Cutout] [FractalNet] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet] [DRN] [DPN] [Residual Attention Network] [IGCNet / IGCV1] [Deep Roots]
2018: [RoR] [DMRNet / DFN-MR] [MSDNet] [ShuffleNet V1] [SENet] [NASNet] [MobileNetV2] [CondenseNet] [IGCV2] [IGCV3] [FishNet] [SqueezeNext] [ENAS] [PNASNet] [ShuffleNet V2] [BAM] [CBAM] [MorphNet] [NetAdapt] [mixup] [DropBlock]
2019: [ResNet-38] [AmoebaNet] [ESPNetv2] [MnasNet] [Single-Path NAS] [DARTS] [ProxylessNAS] [MobileNetV3] [FBNet] [ShakeDrop] [CutMix]