Review — SwAV: Unsupervised Learning of Visual Features by Contrasting Cluster Assignments

SwAV, Using Clustering Codes Via Sinkhorn Distance, No Contrastive Learning, No Large Memory Bank, No Momentum Network

Sik-Ho Tsang
7 min readAug 25


Unsupervised Learning of Visual Features by Contrasting Cluster Assignments,
, by Inria, and Facebook AI Research
2020 NeurIPS, Over 2500 Citations (Sik-Ho Tsang @ Medium)

Self-Supervised Learning
19932022 [BEiT] [BEiT V2] [Masked Autoencoders (MAE)] [DiT] [SimMIM]
==== My Other Paper Readings Are Also Over Here ====

  • Swapping Assignments between multiple Views (SwAV) of the same image, is proposed, which simultaneously clusters the data while enforcing consistency between cluster assignments produced for different augmentations (or “views”) of the same image, instead of comparing features directly as in contrastive learning.
  • A “swapped” prediction mechanism is used where the code of a view is predicted from the representation of another view.
  • A new data augmentation strategy, multi-crop, that uses a mix of views with different resolutions in place of two full-resolution views, without increasing the memory or compute requirements.


  1. SwAV Overall
  2. SwAV Online Clustering
  3. SwAV Multi-Crop
  4. Experimental Results

1. SwAV Overall

1.1. Prior Arts

  • For clustering-based methods, e.g.: DeepCluster, they perform clustering and training iteratively in offline manner, where training is to cluster assignment, i.e. the codes predicted for different image views. However, large amount of forward pass is needed for each iterations.
  • Inspired by contrastive learning, instead of treating the codes as a target, it only enforces consistent mapping between views of the same image.

The method can be interpreted as a way of contrasting between multiple image views by comparing their cluster assignments instead of their features.

1.2. SwAV Overall Idea

Contrastive instance learning (left) vs. Proposed SwAV (right)
  • A code from an augmented version of the image is computed and this code is predicted from other augmented versions of the same image.

Given two image features zt and zs from two different augmentations of the same image, their codes qt and qs are computed by matching these features to a set of K prototypes {c1, …, cK}.

  • Then, a “swapped” prediction problem is setup with the following loss function (Eq. 1):
  • where the function l(z, q) measures the fit between features z and a code q.
  1. In SwAV, first the “codes” are obtained by assigning features to prototype vectors.
  2. Then, a “swapped” prediction problem is solved wherein the codes obtained from one data augmented view are predicted using the other view.
  • Thus, SwAV does not directly compare image features.
  • Prototype vectors are learned along with the ConvNet parameters by backpropragation.

Intuitively, the proposed method compares the features zt and zs using the intermediate codes qt and qs. If these two features capture the same information, it should be possible to predict the code from the other feature.

2. SwAV Online Clustering

2.1. Code Computation

  • Each image xn is transformed into an augmented view xnt.
  • The augmented view is mapped to a vector representation by .
  • The feature is then projected to the unit sphere:

A code qnt is then computed from this feature by mapping znt to a set of K trainable prototypes vectors, {c1, …, cK}, where C is the matrix whose columns are the c1, …, ck.

2.2. Swapped Prediction Problem

  • The Swapped Prediction Problem has two terms as shown in Eq. 1.

Each term is the cross entropy loss between the code and the probability obtained by taking a softmax of the dot products of zi and all prototypes in C:

  • where τ is a temperature parameter.

Taking this loss over all the images and pairs of data augmentations leads to the following loss function for the swapped prediction problem:

  • This loss function is jointly minimized for C and θ.

2.3. Computing Codes Online

Intuitively, as the prototypes C are used across different batches, SwAV clusters multiple instances to the prototypes.

  • Codes are computed using the prototypes C such that all the examples in a batch are equally partitioned by the prototypes. This equipartition constraint ensures that the codes for different images in a batch are distinct, thus preventing the trivial solution where every image has the same code.
  • Given B feature vectors Z=[z1, …, zB], we are interested in mapping them to the prototypes C=[c1, …, cK].
  • This mapping or codes are denoted by Q=[q1, …, qB], and Q is optimized to maximize the similarity between the features and the prototypes , i.e. (Eq. 3):
  • where H is the entropy function and ε is smoothness parameter.
  • Authors adopt SeLA SSL approach, i.e. Sinkhorn Distance, to work on minibatches by restricting the transportation polytope to the minibatch:
  • where 1K denotes the vector of ones in dimension K. These constraints enforce that on average each prototype is selected at least B/K times in the batch.
  • These soft codes Q* are the solution of Eq. 3 over the set Q and takes the form of a normalized exponential matrix:
  • where u and v are renormalization vectors with the size of K and B respectively.
  • In practice, when working with small batches, features from the previous batches are used to augment the size of Z, around 3K features are stored. Only features from the last 15 batches are kept with a batch size of 256, while contrastive methods typically need to store the last 65K instances obtained from the last 250 batches.

3. SwAV Multi-Crop

Multi-crop: the image xn is transformed into V + 2 views: two global views and V small resolution zoomed views
  • A multi-crop strategy is proposed where two standard resolution crops are used and V additional low resolution crops are used that cover only small parts of the image. Using low resolution images ensures only a small increase in the compute cost.
  • The loss is generalized to:
  • Codes are only computed for the full resolution crops.

4. Experimental Results

4.1. Linear Classification

Linear classification on ImageNet.

When using frozen features (left), SwAV outperforms the state of the art by +4.2% top-1 accuracy and is only 1.2% below the performance of a fully supervised model.

4.2. Semi-Supervised Learning

On semi-supervised learning, SwAV outperforms other self-supervised methods and is on par with state-of-the-art semi-supervised models.

4.3. Downstream Tasks

Transfer learning on downstream tasks.

Left: For linear classification performance on the Places205, VOC07, and iNaturalist2018 datasets, SwAV outperforms supervised features on all three datasets. Note that SwAV is the first self-supervised method to surpass ImageNet supervised features on these datasets.

Right: For object detection fine-tuning, SwAV outperforms the supervised pretrained model on both VOC07+12 and COCO datasets.

4.4. Small-Batch Setting

Training in small-batch setting

SwAV only stores a queue of 3840 features, it maintains state-of-the-art performance even when trained in the small batch setting.

SwAV learns much faster and reaches higher performance in 4× fewer epochs.

4.5. Multi-Crop Setting

Top-1 accuracy on ImageNet with a linear classifier trained on top of frozen features from a ResNet-50. (left) Comparison between clustering-based and contrastive instance methods and impact of multi-crop. Self-supervised methods are trained for 400 epochs and supervised models for 200 epochs. (right) Performance as a function of epochs.

Multi-crop strategy consistently improves the performance for all the considered methods by a significant margin of 2-4% top-1 accuracy.

4.7. Uncurated Data Pretraining

Pretraining on uncurated data.
  • Pretraining SwAV is performed on an uncurated dataset of 1 billion random public non-EU images from Instagram.

SwAV outperforms training from scratch by a significant margin.

4.8. Larger Architectures

Large architectures.

SwAV benefits from training on large architectures.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.