[Paper] CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features (Image Classification)
In this story, CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features (CutMix), by NAVER Corp., LINE Plus Corp., and Yonsei University, is shortly presented. In this paper:
- Patches are cut and pasted among training images where the ground truth labels are also mixed proportionally to the area of the patches.
- By making efficient use of training pixels and retaining the regularization effect of regional dropout, CutMix consistently outperforms the state-of-the-art augmentation strategies.
This is a paper in 2019 ICCV with over 190 citations. (Sik-Ho Tsang @ Medium)
1. CutMix
- The goal of CutMix is to generate a new training sample (˜x, ˜y) by combining two training samples (xA, yA) and (xB, yB). The generated training sample (˜x, ˜y) is used to train the model with its original loss function.
- where M ∈ {0,1}W×H denotes a binary mask indicating where to drop out and fill in from two images.
- The combination ratio λ between two data points is sampled from the beta distribution Beta(α,α). In all experiments, α is set to 1, that is λ is sampled from the uniform distribution (0, 1).
- The major difference is that CutMix replaces an image region with a patch from another training image and generates more locally natural image than mixup does.
- To sample the binary mask M, we first sample the bounding box coordinates B = (rx, ry, rw, rh) indicating the cropping regions on xA and xB.
- The region B in xA is removed and filled in with the patch cropped from B of xB.
- The box coordinates are uniformly sampled according to:
- making the cropped area ratio:
2. Comparison with Cutout and mixup
- CutMix is indeed learning to recognize two objects from their respective partial views.
- Cutout successfully lets a model focus on less discriminative parts of the object, while being inefficient due to unused pixels.
- mixup, on the other hand, makes full use of pixels, but introduces unnatural artifacts.
- CutMix efficiently improves upon Cutout by being able to localize the two object classes accurately.
3. Experimental Results
3.1. Ablation Study
- Left: CutMix with α ∈ {0.1, 0.25, 0.5, 1.0, 2.0, 4.0} are evaluated. The best performance is achieved when α = 1.0.
- Right: Feature-level CutMix is tested. 0=image level, 1=after first conv-bn, 2=after layer1, 3=after layer2, 4=after layer3. CutMix achieves the best performance when it is applied on the input images.
3.2. ImageNet
- CutMix achieves the best result, 21.40% top-1 error, among the considered augmentation strategies.
- CutMix outperforms Cutout and mixup, the two closest approaches to ours, by +1.53% and +1.18%, respectively.
- On the feature level as well, It is found that CutMix preferable to mixup, with top-1 errors 21.78% and 22.50%, respectively.
- For deeper model, it is observed that +1.60% and +1.71% respective improvements on ResNet-101 and ResNeXt in top-1 errors due to CutMix.
3.3. CIFAR
- PyramidNet-200 is used as baseline.
- Both Cutout and label smoothing from Inception-v3 does not improve the accuracy when adopted independently, but they are effective when used together.
- DropBlock, the feature-level generalization of Cutout , is also more effective when label smoothing is also used.
- mixup and Manifold Mixup achieve higher accuracies when Cutout is applied on input images.
- CutMix achieves 14.47% top-1 classification error on CIFAR-100, +1.98% higher than the baseline performance 16.45%. A new state-of-the-art performance 13.81% by combining CutMix and ShakeDrop.
- CutMix also significantly improves the performance of the weaker baseline architectures, such as PyramidNet-110 and ResNet-110.
There are also experiments for weakly supervised object localization, Pascal VOC object detection, MS-COCO image captioning, robustness and uncertainty, as well as occlusions. If interested, please feel free to read the paper.
Reference
[2019 ICCV] [CutMix]
CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features
Image Classification
1989–1998: [LeNet]
2012–2014: [AlexNet & CaffeNet] [Maxout] [Dropout] [NIN] [ZFNet] [SPPNet]
2015: [VGGNet] [Highway] [PReLU-Net] [STN] [DeepImage] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2]
2016: [SqueezeNet] [Inception-v3] [ResNet] [Pre-Activation ResNet] [RiR] [Stochastic Depth] [WRN] [Trimps-Soushen]
2017: [Inception-v4] [Xception] [MobileNetV1] [Shake-Shake] [Cutout] [FractalNet] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet] [DRN] [DPN] [Residual Attention Network] [IGCNet / IGCV1] [Deep Roots]
2018: [RoR] [DMRNet / DFN-MR] [MSDNet] [ShuffleNet V1] [SENet] [NASNet] [MobileNetV2] [CondenseNet] [IGCV2] [IGCV3] [FishNet] [SqueezeNext] [ENAS] [PNASNet] [ShuffleNet V2] [BAM] [CBAM] [MorphNet] [NetAdapt] [mixup] [DropBlock]
2019: [ResNet-38] [AmoebaNet] [ESPNetv2] [MnasNet] [Single-Path NAS] [DARTS] [ProxylessNAS] [MobileNetV3] [FBNet] [ShakeDrop] [CutMix]