Review: Unsupervised Embedding Learning via Invariant and Spreading Instance Feature

Instance based softmax embedding method, directly optimizes the ‘real’ instance features on top of the softmax function

Sik-Ho Tsang
5 min readMar 6, 2022


The features of the same instance under different data augmentations should be invariant, while features of different image instances should be separated.

Unsupervised Embedding Learning via Invariant and Spreading Instance Feature, Ye CVPR’19, by Hong Kong Baptist University, and Columbia University
2019 CVPR, Over 200 Citations
(Sik-Ho Tsang @ Medium)
Self-Supervised Learning, Unsupervised Learning, Contrastive Learning, Representation Learning, Image Classification

  • A novel instance based softmax embedding method, directly optimizes the ‘real’ instance features on top of the softmax function.
  • It achieves significantly faster learning speed and higher accuracy than all the competing methods.


  1. Instance-wise Softmax Embedding
  2. Experimental Results

1. Instance-wise Softmax Embedding

The framework of the proposed unsupervised learning method with Siamese network
  • For each iteration, m instances {x1, x2, x3, …} are randomly sampled.
  • For each instance, a random data augmentation operation T() is applied to slightly modify the original image. The augmented sample T(xi) is denoted by ˆxi, and its embedding feature xi) is denoted by ˆfi.
  • The probability of ˆxi being recognized as instance i is defined by:
  • The above equation can be rewritten as:

Maximizing exp(fTiˆfi/τ) requires increasing the inner product (cosine similarity) between fi and ˆfi, resulting in a feature that is invariant to data augmentation.

  • On the other hand, the probability of xj being recognized as instance i is defined by
  • Similarly, the above equation can be rewritten as:

Minimizing exp(fTi fj/τ), aims at separating fj from fi. Thus, it further enhances the spread-out property.

  • Correspondingly, the probability of xj not being recognized as instance i is 1−P(i|xj).
  • The negative log likelihood is given by:
  • Thus, the sum of the negative log likelihood over all the instances within the batch is minimized:

2. Experimental Results

2.1. Training

  • The first setting is that the training and testing sets share the same categories (seen testing category). This protocol is widely adopted for general unsupervised feature learning.
  • The second setting is that the training and testing sets do not share any common categories (unseen testing category).

2.2. Experiments on Seen Testing Categories

  • ResNet-18 is used.
  • Feature Embedding dimension is 128.
  • The training batch size is set to 128 for all competing methods on both datasets.
  • Four kinds of data augmentation methods (RandomResizedCrop, RandomGrayscale, ColorJitter, RandomHorizontalFlip).
kNN accuracy (%) on CIFAR-10 dataset
  • The proposed method achieves the best performance (83.6%) with kNN classifier.

Compared to NPSoftmax [46] and NCE [46] in Instance Discrimination [46], which use memorized feature for optimizing, the proposed method outperform by 2.8% and 3.2% respectively.

Evaluation of the training efficiency on CIFAR-10 dataset. kNN accuracy (%) at each epoch is reported

The learning speed is much faster than the competitors.

Classification accuracy (%) with linear classifier and kNN classifier on STL-10 dataset
  • When only using 5K training images for learning, the proposed method achieves the best accuracy with both classifiers (kNN: 74.1%, Linear: 69.5%).

When 105K images are used, kNN accuracy increases to 81.6% for full 105K training images. The classification accuracy with linear classifier also increases from 69.5% to 77.9%.

2.3. Experiments on Unseen Testing Categories

  • The pre-trained GoogLeNet / Inception-v1 on ImageNet is used.
  • A 128-dim fully connected layer with ℓ2 normalization is added after the pool5 layer as the feature embedding layer.
  • All the input images are firstly resized to 256×256. For data augmentation, the images are randomly cropped at size 227×227 with random horizontal flipping.
  • The temperature parameter τ is set to 0.1. The training batch size is set to 64.
Results (%) on CUB200 dataset
Results (%) on Car196 dataset
  • Generally, the instance-wise feature learning methods (NCE [46], Exemplar [8], Proposed) outperform non-instance-wise feature learning methods (DeepCluster [3], MOM [21]), especially on Car196 and Product datasets.

This indicates instance-wise feature learning methods have good generalization ability on unseen testing categories.

Results (%) on Product dataset using network without pre-trained parameters
  • ResNet-18 without pre-training is used.

The proposed method is also a clear winner.

4NN retrieval results of some example queries on CUB200–2011 dataset
  • Green: Correct; Red: Incorrect.

Although there are some wrongly retrieved samples from other categories, most of the top retrieved samples are visually similar to the query.

2.4. Ablation Study

Effects of each data augmentation operation on CIFAR-10 dataset

RandomResizedCrop contributes the most.

Different sampling strategies on CIFAR-10 dataset.

Without data augmentation (DA), performance drops from 83.6% to 37.4%.

2.5. Understanding of the Learned Embedding

The cosine similarity distributions on CIFAR-10

The proposed method performs best to separate positive and negative samples.

The cosine similarity distributions of randomly initialized network (left column) and our learned model (right column) with different attributes on CIFAR-10

The proposed method also performs well to separate other attributes.

This paper and Instance Discrimination have given a prototype for contrastive learning framework. Later on, there are MoCo, PIRL, SimCLR, MoCo v2, and so on.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.