Brief Review — When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations
When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations
ViT-SAM & Mixer-SAM, by Google Research, and UCLA
2022 ICLR, Over 330 Citations (Sik-Ho Tsang @ Medium)Image Classification
1989 … 2023 [Vision Permutator (ViP)] [ConvMixer] [CrossFormer++] [FastViT] [EfficientFormerV2] [MobileViTv2] [ConvNeXt V2] [SwiftFormer] 2024 [FasterViT]
==== My Other Paper Readings Are Also Over Here ====
- By promoting smoothness with a recently proposed sharpness-aware optimizer, the accuracy and robustness of ViTs and MLP-Mixers are substantially improved on various tasks spanning supervised, adversarial, contrastive, and transfer learning.
1. ViT-SAM & Mixer-SAM
- (Please read Loss Landscape and SAM first.)
1.1. Sharp Local Minima
- As discovered by Loss Landscape, a network with a smooth local minima has better robustness than those with sharp local minima.
However, it is found that ViTs and MLP-Mixers converge at sharp local minima as in Figure 1.
1.2. SAM
Intuitively, SAM seeks to find the parameter w whose entire neighbours have low training loss Ltrain by formulating a minimax objective:
- An efficient first-order approximation is used to find the solution:
2. Results
As shown in Tables 1 and 2, the accuracies of ViT-B/16 and Mixer-B/16 increase by 9.9% and 15.0% (which are 21.2% and 44.4% relative improvements), after SAM smooths their converged local regions.
- In comparison, SAM improves the accuracy of ResNet-152 by 2.2% only (4.4% relative improvement).
- The gaps between ViTs and ResNets are even wider for small architectures. ViT-S/16 outperforms a similarly sized ResNet-50 by 1.4% on ImageNet, and 6.5% on ImageNet-C. SAM also significantly improves MLP-Mixers’ results.
- (Please feel free to read the paper directly for more detailed experimental results.)