Review — MoCo: Momentum Contrast for Unsupervised Visual Representation Learning

Momentum Update for the Key Encoder, Outperforms Exemplar-CNN, Context Prediction, Jigsaw Puzzles, RotNet/Image Rotations, Colorization, DeepCluster, Instance Discrimination, CPCv1, CPCv2, CMC.

Sik-Ho Tsang
9 min readFeb 6, 2022
Momentum Contrast (MoCo)

Momentum Contrast for Unsupervised Visual Representation Learning
MoCo, by Facebook AI Research (FAIR)
2020 CVPR, Over 2400 Citations (Sik-Ho Tsang @ Medium)
Self-Supervised Learning, Contrastive Learning, Image Classification, Object Detection, Segmentation

  • A dynamic dictionary with a queue and a moving-averaged encoder are built. This enables building a large and consistent dictionary on-the-fly that facilitates contrastive unsupervised learning.


  1. Contrastive Learning as Dictionary Lookup
  2. Momentum Contrast (MoCo)
  3. Ablation & ImageNet Results
  4. Transferring Features Results

1. Contrastive Learning

Left: Contrastive Learning Without Dictionary Lookup, Right Contrastive Learning With Dictionary Lookup
  • Contrastive learning since DrLIM, and its recent developments, can be thought of as training an encoder for a dictionary look-up task.
  • Consider an encoded query q and a set of encoded samples {k0, k1, k2, …} that are the keys of a dictionary.
  • Assume that there is a single key (denoted as k+) in the dictionary that q matches. A contrastive loss in DrLIM is a function whose value is low when q is similar to its positive key k+ and dissimilar to all other keys (considered negative keys for q).
  • With similarity measured by dot product, a form of a contrastive loss function, called InfoNCE (CPCv1), is considered in this paper:
  • where τ=0.07 is a temperature hyper-parameter as in Instance Discrimination.
  • The sum is over one positive and K negative samples. Intuitively, this loss is the log loss of a (K+1)-way softmax-based classifier that tries to classify q as k+.
  • In general, the query representation is q=fq(xq) where fq is an encoder network and xq is a query sample.

1.1. (a) End-to-End

  • The end-to-end update by back-propagation is a natural mechanism.
  • The keys are consistently encoded (by the same set of encoder parameters). But the dictionary size is coupled with the mini-batch size, limited by the GPU memory size. It is also challenged by large mini-batch optimization.

1.2. (b) Memory Bank

  • The dictionary for each mini-batch is randomly sampled from the memory bank with no back-propagation, so it can support a large dictionary size.
  • The memory bank was updated when it was last seen, so the sampled keys are essentially about the encoders at multiple different steps all over the past epoch and thus are less consistent.
  • A momentum update is adopted on the memory bank in Instance Discrimination. Its momentum update is on the representations of the same sample, not the encoder.
  • (Please read NCE, CPCv1 and Instance Discrimination for more details.)

2. Momentum Contrast (MoCo)

Momentum Contrast (MoCo)
  • The dictionary is dynamic in the sense that the keys are randomly sampled, and that the key encoder evolves during training.

The hypothesis is that good features can be learned by a large dictionary that covers a rich set of negative samples, while the encoder for the dictionary keys is kept as consistent as possible despite its evolution.

2.1. Dictionary as a Queue

  • MoCo maintains the dictionary as a queue of data samples.
  • This allows to reuse the encoded keys from the immediate preceding mini-batches.
  • The dictionary size can be much larger than a typical mini-batch size.

The samples in the dictionary are progressively replaced. The current mini-batch is enqueued to the dictionary, and the oldest mini-batch in the queue is removed.

  • Removing the oldest mini-batch can be beneficial, because its encoded keys are the most outdated.

2.2. Momentum Update

  • Using a queue makes it intractable to update the key encoder by back-propagation.
  • A naïve solution is to copy the key encoder fk from the query encoder fq, ignoring this gradient. But this solution yields poor results. It is hypothesized that such failure is caused by the rapidly changing encoder that reduces the key representations’ consistency.

Formally, denoting the parameters of fk as k and those of fq as q, k is updated by:

Here m ∈ [0, 1) is a momentum coefficient. Only the parameters q are updated by back-propagation.

  • As a result, though the keys in the queue are encoded by different encoders (in different mini-batches), the difference among these encoders can be made small.
  • In experiments, a relatively large momentum (e.g., m=0.999, default) works much better than a smaller value (e.g., m=0.9), suggesting that a slowly evolving key encoder is a core to making use of a queue.

2.3. Some Other Details

MoCo Algorithm
  • Query x_q and key x_p are two augmented versions of x, i.e. two random “views” of the same image under random data augmentation to form a positive pair.
  • The data augmentation setting follows: a 224×224-pixel crop is taken from a randomly resized image, and then undergoes random color jittering, random horizontal flip, and random grayscale conversion.
  • The queries and keys are respectively encoded by their encoders, f_q and f_k.
  • Similar to Instance Discrimination, a ResNet is as the encoder, whose last fully-connected layer (after global average pooling) has a fixed-dimensional output (128-D). This output vector is normalized by its L2-norm. This is the representation of the query or key.

2.4. Shuffling BN

  • Using BN prevents the model from learning good representations. The model appears to “cheat” the pretext task and easily finds a low-loss solution.
  • Multiple GPUs are used to train the model and BN is performed on the samples independently for each GPU (as done in common practice).
  • For the key encoder fk, the sample order in the current mini-batch is shuffled before distributing it among GPUs (and shuffle back after encoding); The sample order of the mini-batch for the query encoder fq is not altered.
  • This ensures the batch statistics used to compute a query and its positive key come from two different subsets. This effectively tackles the cheating issue and allows training to benefit from BN.

3. Ablation & ImageNet Results

3.1. Datasets

  • ImageNet-1M (IN-1M): 1.28 million images in 1000 classes (often called ImageNet-1K.
  • Instagram-1B (IG-1B): Following WSL, this is a dataset of 1 billion (940M) public images from Instagram.
  • For IN-1M, a mini-batch size of 256 (N in Algorithm 1) is used in 8 GPUs. It taking 53 hours to train ResNet-50.
  • For IG-1B, a mini-batch size of 1024 is used in 64 GPUs. It takes about 6 days to train ResNet-50.
  • Linear Classification Protocol: A very common protocol that a classifier is trained on the global average pooling features of a ResNet, for 100 epochs. 1-crop, top-1 classification accuracy on the ImageNet validation set is reported.

3.2. Ablation: Contrastive Loss Mechanisms

Comparison of three contrastive loss mechanisms under the ImageNet linear classification protocol
  • Overall, all three mechanisms benefit from a larger K.
  • The end-to-end mechanism performs similarly to MoCo when K is small. But the largest mini-batch a high-end machine (8 Volta 32GB GPUs) can afford is 1024 for the end-to-end mechanism.
  • The memory bank mechanism inInstance Discrimination can support a larger dictionary size. But it is 2.6% worse than MoCo.

3.3. Ablation: Momentum

Study of Momentum m
  • It performs reasonably well when m is in 0.99~0.9999, showing that a slowly progressing (i.e., relatively large momentum) key encoder is beneficial.
  • When m is too small (e.g., 0.9), the accuracy drops considerably.

3.3. SOTA Comparison

Comparison under the linear classification protocol on ImageNet
  • Besides ResNet-50 (R50) [33], 2 and 4 wider (more channels) variants are also tested.
  • K=65536 and m=0.999.

MoCo with R50 performs competitively and achieves 60.6% accuracy, better than all competitors of similar model sizes (24M).

MoCo benefits from larger models and achieves 68.6% accuracy with R50×4 outperforms such as Exemplar-CNN, Relative Context Prediction, Jigsaw Puzzles, RotNet/Image Rotations, Colorization, DeepCluster, Instance Discrimination, LocalAgg, CPCv1, CPCv2, CMC.

  • “MoCo v2” [8], an extension of a preliminary version of this manuscript, achieves 71.1% accuracy with R50 (up from 60.6%), given small changes on the data augmentation and output projection head [7].
  • (Hope I can review MoCo v2 later in the coming future.)

4. Transferring Features Results

  • Features produced by unsupervised pre-training can have different distributions compared with ImageNet supervised pre-training.
  • Feature normalization is adopted during fine-tuning: Fine-tune with BN that is trained, BN in the newly initialized layers (e.g., FPN) is also used.

4.1. PASCAL VOC Object Detection

  • The detector is Faster R-CNN with a backbone of R50-dilated-C5 or R50-C4. All layers are fine-tuned end-to-end.
Object detection fine-tuned on PASCAL VOC trainval07+12 In the brackets are the gaps to the ImageNet supervised pre-training counterpart
  • (a): For R50-dilated-C5, MoCo pre-trained on IN-1M is comparable to the supervised pre-training counterpart, and MoCo pretrained on IG-1B surpasses it.
  • (b): For R50-C4, MoCo with IN-1M or IG-1B is better than the supervised counterpart: up to +0.9 AP50, +3.7 AP, and +4.9 AP75.
Comparison with previous methods on object detection fine-tuned on PASCAL VOC trainval2007
  • For the AP50 metric, no previous method can catch up with its respective supervised pre-training counterpart.

MoCo pre-trained on any of IN-1M, IN-14M (full ImageNet), YFCC-100M [55], and IG-1B can outperform the supervised baseline.

4.2. COCO Object Detection and Segmentation

Object detection and instance segmentation fine-tuned on COCO
  • The model is Mask R-CNN [32] with the FPN [41] or C4 backbone, with BN tuned, is used.
  • With the 1× schedule, all models (including the ImageNet supervised counterparts) are heavily under-trained, as indicated by the 2 points gaps to the 2× schedule cases.

With the 2× schedule, MoCo is better than its ImageNet supervised counterpart in all metrics in both backbones.

4.3. More Downstream Tasks

MoCo vs. ImageNet supervised pre-training, finetuned on various tasks
  • COCO keypoint detection: Supervised pre-training has no clear advantage over random initialization, whereas MoCo outperforms in all metrics.
  • COCO dense pose estimation: MoCo substantially outperforms supervised pre-training, e.g., by 3.7 points in APdp75, in this highly localization-sensitive task.
  • LVIS v0.5 instance segmentation: MoCo with IG-1B surpasses it in all metrics.
  • Cityscapes instance segmentation: MoCo with IG-1B is on par with its supervised pre-training counterpart in APmk, and is higher in APmk50 .
  • Semantic segmentation: On Cityscapes, MoCo outperforms its supervised pre-training counterpart by up to 0.9 point. But on VOC semantic segmentation, MoCo is worse by at least 0.8 point, a negative case we have observed.

In sum, MoCo can outperform its ImageNet supervised pre-training counterpart in 7 detection or segmentation tasks.

  • Remarkably, in all these tasks, MoCo pre-trained on IG-1B is consistently better than MoCo pre-trained on IN-1M. This shows that MoCo can perform well on this large-scale, relatively uncurated dataset. This represents a scenario towards real-world unsupervised learning.
  • (There are still many results in appendix, please feel free to the paper if interested.)

There are MoCo v2 and MoCo v3, hope I can review them in the coming future. :)



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.