Review — MoCo v3: An Empirical Study of Training Self-Supervised Vision Transformers

Instability Study of ViT for Self-Supervised Learning

(Figure from ViT)

An Empirical Study of Training Self-Supervised Vision Transformers
MoCo v3, by Facebook AI Research (FAIR)
2021 ICCV, Over 100 Citations (Sik-Ho Tsang @ Medium)
Self-Supervised Learning, Unsupervised Learning, Contrastive Learning, Representation Learning, Image Classification, Vision Transformer (ViT)

  • MoCo v3 is an incremental improvement of MoCo v1/MoCo v2, studying the instability issue when ViT is used for self-supervised learning.


  1. MoCo v3 Using ResNet (Before Using ViT)
  2. Stability Study for Basic Factors When Using ViT
  3. Trick for Improving Stability
  4. Experimental Results

1. MoCo v3 Using ResNet (Before Using ViT)

  • As the same in MoCo v2, InfoNCE used in CPC, is used as the loss function for training. Large batch is used to include more negative samples.
  • Different from MoCo v2, in MoCo v3, the keys that used naturally co-exists in the same batch. The memory queue (memory bank) is abandoned. Thus, this setting is the same as SimCLR.
  • The encoder fq consists of a backbone (e.g., ResNet, ViT), a projection head [10], and an extra prediction head [18].
  • The encoder fk has the backbone and projection head, but not the prediction head. fk is updated by the moving-average of fq as in MoCo, excluding the prediction head.

By using ResNet-50, the improvement here is mainly due to the extra prediction head and large-batch (4096) training.

2. Stability Study for Basic Factors When Using ViT

  • It is straightforward to replace a ResNet backbone with a ViT backbone. But in practice, a main challenge occurred is the instability of training.

2.1. Batch Size

  • A large batch is also beneficial for accuracy. A batch of 1k and 2k produces reasonably smooth curves, with 71.5% and 72.6% linear probing accuracy.

The curve of a 4k batch becomes noticeably unstable: see the “dips”. The curve of a 6k batch has worse failure patterns.

2.2. Learning Rate

  • When lr is smaller, the training is more stable, but it is prone to under-fitting.

lr=1.5e-4 for this setting has more dips in the curve, and its accuracy is lower. In this regime, the accuracy is determined by stability.

2.3. Optimizer

  • AdamW is the default optimizer. LAMB is studied.
  • Although LAMB can avoid sudden change in the gradients, the negative impact of unreliable gradients is accumulated.

As a result, authors opt to use AdamW.

3. Trick for Improving Stability

3.1. Random Patch Projection

  • It is found that a sudden change of gradients (a “spike”) causes a “dip” in the training curve.

The gradient spikes happen earlier in the first layer (patch projection), and are delayed by couples of iterations in the last layers.

  • The instability happens earlier in the shallower layers.

The patch projection layer is frozen during training. In other words, a fixed random patch projection layer is used to embed the patches.

Random patch projection stabilizes training, with smoother and better training curves. This stability benefits the final accuracy, boosting the accuracy by 1.7% to 73.4%.

3.1. Random Patch Projection on SimCLR and BYOL

Random patch projection improves stability in both SimCLR and BYOL, and increases the accuracy by 0.8% and 1.3%.

4. Experimental Results

4.1. Models

4.2. Training Time

  • It takes 2.1 hours training ViT-B for 100 epochs. ViT-H takes 9.8 hours per 100 epochs using 512 TPUs.

4.3. Self-Supervised Learning Frameworks

MoCo v3 has better accuracy on ViT than other frameworks.

MoCo v3 and SimCLR are more favorable for ViT-B than R50.

4.4. Ablations of ViT + MoCo v3

Surprisingly, the model works decently even with no position embedding (74.9%). The capability to encode positions contributes only 1.6%.

Class token is not essential for the system to work.

BN is not necessary for contrastive learning to work, yet appropriate usage of BN can improve accuracy.

Removing the prediction MLP head has a decent result of 75.5%.

The optimal value is m=0.99 (default). The case of m=0 is analogous to SimCLR.

The smaller ViT-S enjoys the benefit of training longer, and improves by 0.9% when extending to 600 epochs.

4.5. Comparisons with Prior Art

MoCo-based ViT has higher accuracy and smaller models than iGPT, under the same linear probing protocol and training data.

  • ViT models have higher accuracy when the models are bigger.

ViT, MoCo v3” is slightly better than ResNet SimCLR v2 in the small-model regime, but the envelopes become just comparable for larger models.

  • SimCLR v2 with SK-ResNet (Selective Kernel [29], a form of attention) has a higher envelope. BYOL also has a higher envelope with wider ResNets (1–4×), and has an outstanding point with a deeper ResNet (R200–2×).

Using DeiT as codebase, MoCo v3 achieves 83.2% with ViT-B under 150-epoch fine-tuning. In addition, MoCo v3 has 84.1% with ViT-L when fine-tuned for only 100 epochs with a drop path rate of 0.5.

4.6. Transfer Learning

  • The proposed self-supervised ViT has better transfer learning accuracy when the model size increases from ViT-B to ViT-L, yet it gets saturated when increased to ViT-H.



PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store