Brief Review — When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations

ViT-SAM & Mixer-SAM, by promoting smoothness using SAM on ViTs and MLP-Mixers

3 min readSep 10, 2024

**Loss Landscape**: **ViT** vs **ViT**-**SAM** **(ViT Using** **SAM**)

When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations
ViT-SAM & Mixer-SAM, by Google Research, and UCLA
2022 ICLR, Over 330 Citations (Sik-Ho Tsang @ Medium)
Image Classification
1989 … 2023 [Vision Permutator (ViP)] [ConvMixer] [CrossFormer++] [FastViT] [EfficientFormerV2] [MobileViTv2] [ConvNeXt V2] [SwiftFormer] 2024 [FasterViT]
==== My Other Paper Readings Are Also Over Here ====

By promoting smoothness with a recently proposed sharpness-aware optimizer, the accuracy and robustness of ViTs and MLP-Mixers are substantially improved on various tasks spanning supervised, adversarial, contrastive, and transfer learning.

Outline

ViT-SAM & Mixer-SAM
Results

1. ViT-SAM & Mixer-SAM

(Please read Loss Landscape and SAM first.)

1.1. Sharp Local Minima

As discovered by Loss Landscape, a network with a smooth local minima has better robustness than those with sharp local minima.

However, it is found that ViTs and MLP-Mixers converge at sharp local minima as in Figure 1.

There also exists a large gap between ViTs and ResNets in robustness tests.

1.2. SAM

Intuitively, SAM seeks to find the parameter w whose entire neighbours have low training loss Ltrain by formulating a minimax objective:

An efficient first-order approximation is used to find the solution:

After using SAM, as in Figure 1 (d) and (e) above, we can see ViT-SAM and Mixer-SAM are having smoother local minima.

2. Results

As shown in Tables 1 and 2, the accuracies of ViT-B/16 and Mixer-B/16 increase by 9.9% and 15.0% (which are 21.2% and 44.4% relative improvements), after SAM smooths their converged local regions.

In comparison, SAM improves the accuracy of ResNet-152 by 2.2% only (4.4% relative improvement).
The gaps between ViTs and ResNets are even wider for small architectures. ViT-S/16 outperforms a similarly sized ResNet-50 by 1.4% on ImageNet, and 6.5% on ImageNet-C. SAM also significantly improves MLP-Mixers’ results.
(Please feel free to read the paper directly for more detailed experimental results.)

Brief Review — When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations

ViT-SAM & Mixer-SAM, by promoting smoothness using SAM on ViTs and MLP-Mixers

Outline

1. ViT-SAM & Mixer-SAM

1.1. Sharp Local Minima

1.2. SAM

2. Results

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Sik-Ho Tsang

No responses yet