Brief Review — When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations

ViT-SAM & Mixer-SAM, by promoting smoothness using SAM on ViTs and MLP-Mixers

Sik-Ho Tsang
3 min readSep 10, 2024
Loss Landscape: ViT vs ViT-SAM (ViT Using SAM)

When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations
ViT-SAM & Mixer-SAM
, by Google Research, and UCLA
2022 ICLR, Over 330 Citations (Sik-Ho Tsang @ Medium)

Image Classification
1989 … 2023
[Vision Permutator (ViP)] [ConvMixer] [CrossFormer++] [FastViT] [EfficientFormerV2] [MobileViTv2] [ConvNeXt V2] [SwiftFormer] 2024 [FasterViT]
==== My Other Paper Readings Are Also Over Here ====

  • By promoting smoothness with a recently proposed sharpness-aware optimizer, the accuracy and robustness of ViTs and MLP-Mixers are substantially improved on various tasks spanning supervised, adversarial, contrastive, and transfer learning.

Outline

  1. ViT-SAM & Mixer-SAM
  2. Results

1. ViT-SAM & Mixer-SAM

1.1. Sharp Local Minima

  • As discovered by Loss Landscape, a network with a smooth local minima has better robustness than those with sharp local minima.

However, it is found that ViTs and MLP-Mixers converge at sharp local minima as in Figure 1.

  • There also exists a large gap between ViTs and ResNets in robustness tests.

1.2. SAM

Intuitively, SAM seeks to find the parameter w whose entire neighbours have low training loss Ltrain by formulating a minimax objective:

  • An efficient first-order approximation is used to find the solution:
  • After using SAM, as in Figure 1 (d) and (e) above, we can see ViT-SAM and Mixer-SAM are having smoother local minima.

2. Results

Performance When Using SAM

As shown in Tables 1 and 2, the accuracies of ViT-B/16 and Mixer-B/16 increase by 9.9% and 15.0% (which are 21.2% and 44.4% relative improvements), after SAM smooths their converged local regions.

  • In comparison, SAM improves the accuracy of ResNet-152 by 2.2% only (4.4% relative improvement).
  • The gaps between ViTs and ResNets are even wider for small architectures. ViT-S/16 outperforms a similarly sized ResNet-50 by 1.4% on ImageNet, and 6.5% on ImageNet-C. SAM also significantly improves MLP-Mixers’ results.
  • (Please feel free to read the paper directly for more detailed experimental results.)

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.