Brief Review — Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

A Soup of Models, Improve Accuracy Without Increase in Time

Sik-Ho Tsang
3 min readAug 15, 2024
Model Soups, SOTA in 2022, e.g.: outperforms ViT-G (Figure from paperswithcode)

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time
Model Soups
, by University of Washington, Columbia University, Google Research, Brain Team, Meta AI Research, and Tel Aviv University
2022 ICML, Over 650 Citations (Sik-Ho Tsang @ Medium)

Image Classification
1989 … 2023
[Vision Permutator (ViP)] [ConvMixer] [CrossFormer++] [FastViT] [EfficientFormerV2] [MobileViTv2] 2024 [FasterViT]
==== My Other Paper Readings Are Also Over Here ====

  • The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder.
  • In this paper, the second step is revisited. It is conjectured that fine-tuned models often appear to lie in a single low error basin. It is shown that averaging the weights of multiple models finetuned with different hyperparameter configurations often improves accuracy and robustness.

Outline

  1. Model Soup
  2. Results

1. Model Soup

Left: Model Soup Variants, Right: GreedySoup
  • There are 3 recipes for model souping, the uniform, greedy, and learned soup.
  • Greedy soup is the central method.
  • Consider a neural network f(x, θ) with input data x and parameters θ.
  • For hyperparameter configurations h1, …, hk, let θi = FineTune(θ0, hi). Conventionally, the parameters θj which attain the highest accuracy on a held out validation set are selected, and the remaining parameters are discarded.
Model Soups Improve Accuracy
  • Instead, model soups f(x, θS) use an average of θi.
  • The uniform soup is constructed by averaging all fine-tuned models θi and so S = {1, …, n}. Yet, a model with low accuracy can result in a low accuracy uniform soup.

This issue can be circumvented with a greedy soup (Recipe 1). The greedy soup is constructed by sequentially adding each model as a potential ingredient in the soup, and only keeping the model in the soup if performance on a held out validation set (disjoint from the training and test sets) improves.

  • A more advanced learned soup recipe is also explored that optimizes model interpolation weights by gradient-based minibatch optimization (see Appendix I for details). This procedure requires simultaneously loading all models in memory which currently hinders its use with large networks.

2. Results

2.1. Models

The primary models that are fine-tuned are the CLIP, ALIGN, and BASIC models pretrained with contrastive supervision from image-text pairs, a ViT-G/14 model pre-trained on JFT-3B, and Transformer models for text classification.

For essentially any number of models, the greedy soup outperforms the best single model on both the ImageNet and the out-of-distribution test sets.

Greedy soup can provide additional gains on top of standard hyperparameter tuning even in the extremely high accuracy regime.

While the improvements in text classification are not as pronounced as in image classification, the greedy soup can improve performance over the best individual model in many cases.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.