Review — CPCv2: Data-Efficient Image Recognition with Contrastive Predictive Coding

Improves CPCv1, Outperforms MoCo, Instance Discrimination, RotNet/Image Rotations, DeepCluster, etc.

Sik-Ho Tsang
6 min readFeb 12, 2022
CPCv2 Framework Overview

Data-Efficient Image Recognition with Contrastive Predictive Coding
CPCv2, by DeepMind, and University of California
2020 ICLR, Over 600 Citations (

@ Medium)
Contrastive Learning, Self-Supervised Learning, Image Classification, Object Detection

  • Contrastive Predictive Coding, CPC/CPCv1, is revisited and improved, which obtains SOTA classification accuracy on ImageNet.

Outline

  1. CPC Self-Supervised Pre-Training
  2. Evaluation Protocol
  3. Experimental Results

1. CPC Self-Supervised Pre-Training

CPC Self-Supervised Pre-Training
CPC Self-Supervised Pre-Training
  • Each input image is first divided into a grid of overlapping patches xi,j. Each patch is encoded with a neural network into a single vector zi,j=(xi,j).
  • To make predictions, a masked convolutional network is then applied to the grid of feature vectors. The masks are such that the receptive field of each resulting context vector ci,j only includes feature vectors that lie above it in the image:
  • The prediction task then consists of predicting ‘future’ feature vectors zi+k,j from current context vectors ci,j, where k>0.
  • The predictions are made linearly: given a context vector ci,j , a prediction length k>0, and a prediction matrix Wk, the predicted feature vector ^zi+k,j is:

The goal is to correctly recognize the target zi+k,j among a set of randomly sampled feature vectors {zl} from the dataset.

  • The CPC objective, which is based on Softmax, is:
  • The negative samples {zl} are taken from other locations in the image and other images in the mini-batch. This loss is called InfoNCE which is inspired by Noise-Contrastive Estimation (NCE).

2. Evaluation Protocol

  • Having trained an encoder network , a context network , and a set of linear predictors {Wk} using the CPC objective, the encoder is used to form a representation z=(x) of new observations x, and the rest is discarded.
  • While pre-training required that the encoder be applied to patches, for downstream recognition tasks we can apply it directly to the entire image.
  • Given a dataset of N unlabeled images Du={xn}, and a (potentially much smaller) dataset of M labeled images Dl={xm, ym}:
  • In all cases, the dataset of unlabeled images Du is the full ImageNet ILSVRC 2012 training set.
  • There are different evaluation protocols as follows:

2.1. Linear Classification

Linear Classification
  • The classification network is restricted to mean pooling followed by a single linear layer, and the parameters of are kept fixed.
  • The labeled dataset Dl is ImageNet. The supervised loss LSup is standard cross-entropy.

2.2. Efficient Classification

Efficient Classification
  • The classification network is an arbitrary deep neural network (11-block ResNet with 4096-dimensional feature maps and 1024-dimensional bottleneck layers). Whole network is fine-tuned.
  • The labeled dataset Dl is a random subset (1%, 2%, 5%, 10%, 20%, 50% and 100%) of the ImageNet dataset.

2.3. Transfer Learning

Transfer Learning
  • Object detection on the PASCAL VOC 2007 dataset is used.
  • and LSup are the Faster-RCNN architecture and loss. In addition to color-dropping, The scale-augmentation from Context Prediction for training is used. Whole network is fine-tuned.

2.4. Supervised Training as Baseline

Supervised Training as Baseline
  • Supervised training is used as baseline.

3. Experimental Results

3.1. From CPCv1 to CPCv2

Linear classification performance of new variants of CPC (Top-1 Acc) (MC: model capacity. BU: bottom-up spatial predictions. LN: layer normalization. RC: random color-dropping. HP: horizontal spatial predictions. LP: larger patches. PA: further patch-based augmentation.)
  • The original CPC model used only the first 3 stacks of a ResNet-101, Now third residual stack of the ResNet-101 (containing 23 blocks), it is called ResNet-161, is considered. This new architecture delivers better performance without any further modifications (+5% accuracy).
  • The model’s expressivity is increased by increasing the size of its receptive field with larger patches (from 64×64 to 80×80 pixels; +2% accuracy).
  • Layer normalization (LN) instead of BN, +2% accuracy).
  • Prediction lengths and directions: CPCv1 predicts each patch using only context from above, CPCv2 repeatedly predicts the same patch using context from below, the right and the left (using separate context networks), resulting in up to four times as many prediction tasks. Additional predictions tasks incrementally increased accuracy (adding bottom-up predictions: +2% accuracy; using all four spatial directions: +2.5% accuracy).
  • Patch-based augmentation: Randomly drops two of the three color channels in each patch, +3%. Shearing, rotation, as well as random elastic deformations and color transforms, +4.5% in total.

3.2. Linear Classification

Linear classification accuracy, and comparison to other self-supervised methods
  • With enhanced techniques used, CPCv2 sets a new state-of-the-art in linear classification of 71.5% Top-1 accuracy compared to 48.7% for the original CPCv1.)
  • A ResNet-50 architecture for the CPCv2 objective is also trained, arriving at 63.8% linear classification accuracy. This model outperforms methods which use the same architecture, as well as many recent approaches which at times use substantially larger ones.

3.3. Efficient Classification

Data-efficient image recognition with Contrastive Predictive Coding
Data-efficient image classification
  • The accuracy of the best model reaches 44.1% Top-5 accuracy when trained on 1% of the dataset (compared to 95.2% when trained on the entire dataset).
  • Surprisingly, when given the entire dataset, this classifier reaches 83.4%/96.5% Top1/Top5 accuracy, surpassing the supervised baseline.
  • With AutoAugment: 80.0%/95.0%.

With only 50% of the labels CPCv2 surpasses the supervised baseline given the entire dataset, representing a 2× gain in data-efficiency (blue boxes).

Similarly, with only 1% of the labels, CPCv2 surpasses the supervised baseline given 5% of the labels (i.e. a 5× gain in data-efficiency, red boxes).

3.4. Other Unsupervised Representations

Comparison to other methods for semi-supervised learning

CPCv2 provides gains in data-efficiency that were previously unseen from representation learning methods, and rival the performance of the more elaborate label-propagation algorithms.

3.5. Transfer Learning: Object Detection on PASCAL VOC 2007

Comparison of PASCAL VOC 2007 object detection accuracy to other transfer methods
  • This dataset also tests the efficiency of the representation as it only contains 5011 labeled images to train from.
  • The standard protocol in this setting is to train an ImageNet classifier in a supervised manner, and use it as a feature extractor for a Faster R-CNN object detection architecture.
  • Following this procedure, 74.7% mAP is obtained with a ResNet-152.

In contrast, if the CPCv2 encoder is used as a feature extractor in the same setup, 76.6% mAP is obtained.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.