Review — Unsupervised Learning of Visual Representations using Videos

Contrastive Learning Using Unsupervised Tracking in Videos

Sik-Ho Tsang
7 min readJan 4, 2022
Unsupervised Tracking in Videos

Unsupervised Learning of Visual Representations using Videos
Wang ICCV’15, by Robotics Institute, Carnegie Mellon University
2015 ICCV, Over 900 Citations (Sik-Ho Tsang @ Medium)
Contrastive Learning, Unsupervised Learning, Object Detection

Authors start the paper by asking about the questions:

  • Is strong supervision necessary for learning a good visual representation? Do we really need millions of semantically-labeled images to train a Convolutional Neural Network (CNN)?

In this paper:

  • Hundreds of thousands of unlabeled videos are crawled from the web to learn visual representations.
  • Two patches connected by a track should have similar visual representation in deep feature space since they probably belong to the same object or object part.
  • A Siamese-triplet network using AlexNet with a ranking loss function is used to train this CNN representation.


  1. Approach Overview
  2. Patch Mining in Videos
  3. Siamese Triplet Network
  4. Model Fine-Tuning
  5. Experimental Results

1. Approach Overview

Approach Overview
  • (a): Given unlabeled videos, a unsupervised tracking is performed on the patches in them.
  • (b): Triplets of patches including query patch in the initial frame of tracking, tracked patch in the last frame, and random patch from other videos are fed into the siamese-triplet network for training.
  • (c): The learning objective: Distance between the query and tracked patch in feature space should be smaller than the distance between query and random patches.

2. Patch Mining in Videos

Patch Mining in Videos
  • Before training, we need to mine the patches. Two-step approach is used.
  • In the first step, SURF [1] interest points are obtained and Improved Dense Trajectories (IDT) [50] is used to obtain motion of each SURF point, since IDT applies a homography estimation (video stabilization) method, it reduces the problem caused by camera motion.
  • Given the trajectories of SURF interest points, these points are classified as moving if the flow magnitude is more than 0.5 pixels.
  • Frames are rejected if (a) very few (< 25%) SURF interest points are classified as moving because it might be just noise; (b) majority of SURF interest points (> 75%) are classified as moving as it corresponds to moving camera. In the second step, the best bounding box is found such that it contains most of the moving SURF points. The size of the bounding box is set as h×w (227×227 in a frame of 448×600).
  • In the second step, a sliding window is used within the frame to find the bounding box that contains the most number of moving SURF interest points.
  • Given the initial bounding box, KCF tracker [19] is used to track the bounding box. After tracking along 30 frames in the video, the second patch is obtained.

This patch acts as the similar patch to the query patch in the triplet.

Examples of patch pairs we obtain via patch mining in the videos

Finally, millions of pairs are generated, as shown above.

  • 100K videos are downloaded from YouTube using the URLs provided by [30]. 8 million image patches are obtained.

Three different networks are trained separately using 1.5M, 5M and 8M training samples.

3. Siamese Triplet Network

Siamese Triplet Network

3.1. Network Architecture

  • A Siamese-triplet network consist of three base networks which share the same parameters.
  • The image is with size 227×227 as input.
  • The base network is based on the AlexNet for the convolutional layers. Then, two fully connected layers are stacked on the pool5 outputs, whose neuron numbers are 4096 and 1024 respectively.

Thus the final output of each single network is 1024 dimensional feature space f().

3.2. Ranking Loss Function

Top response regions for the pool5 neurons of our unsupervised-CNN
  • Given an image X as an input for the network, we can obtain its feature in the final layer as f(X). Then, the distance of two image patches X1, X2 is defined based on the cosine distance in the feature space as:

The distance between query image patch and the tracked patch is small and the distance between query patch and other random patches is encouraged to be larger.

  • Formally, Xi is the original query patch (first patch in tracked frames) X+i is the tracked patch and X-i is a random patch from a different video. To enforce:
  • Hinge loss is used:
  • where M=0.5 represents the gap parameters between two distances.
  • The objective function is:
  • where N is the number of the triplets of samples. λ is a constant representing weight decay, which is set to λ=0.0005.

3.3. Hard Negative Mining for Triplet Sampling

  • For negative sample selection, authors first select the negative patches randomly, and then find hard examples.
  • After 10 epochs of training using negative data selected randomly, the negative patch is selected such that the loss is maximum.
  • Specifically, for each pair {Xi, X+i}, the loss of all other negative patches in batch B=100 is calculated, and the top K=4 negative patches with highest losses are selected.

4. Model Fine-Tuning

After fine-tuning on PASCAL VOC 2012, these filters become quite strong

4.1. Straightforward Way

  • One straight forward approach is directly applying the ranking model as a pre-trained network for the target task. For the fully connected layers, they are initialized randomly.

4.2. Iterative Fine-Tuning Scheme

  • Given the initial unsupervised network, it is first fine-tuned using the PASCAL VOC data.
  • Given the new fine-tuned network, this network is to re-adapt to ranking triplet task.
  • Here, transfer convolutional parameters are transferred for re-adapting.
  • Finally, this re-adapted network is fine-tuned on the VOC data yielding a better trained model.
  • After two iterations of this approach, the network converges.

4.3. Model Ensemble

  • There are billions of videos in YouTube, this opens up the possibility of training multiple CNNs using different sets of data.
  • Once these CNNs are trained , the fc7 features are appended from each of these CNNs to train the final SVM.

5. Experimental Results

mean Average Precision (mAP) on VOC 2012. “external” column shows the number of patches used to pre-train unsupervised-CNN.

5.1. Single Model

  • The detection pipeline introduced in R-CNN is followed where the CNNs pre-trained on other datasets and fine-tuned on it with the VOC data. The fine-tuned CNN was then used to extract features followed by training SVMs for each object class.
  • As a baseline, the network is trained from scratch on VOC 2012 dataset and obtain 44% mAP.
  • Using the proposed unsupervised network pre-trained with 1.5M pair of patches and then fine-tuned on VOC 2012, mAP of 46.2% is obtained (unsup+ft, external data = 1.5M).
  • However, using more data, 5M and 8M patches in pre-training and then fine-tune, 47% and 47.5% mAP are achieved.

A significant boost is observed as compared to the scratch network. More importantly, when more unlabeled data is applied, we can get better performance.

5.2. Model Ensemble

  • By ensembling two fine-tuned networks which are pre-trained using 1.5M and 5M patches, we obtained a boost of 3.5% comparing to the single model, which is 50.5% (unsup+ft (2 ensemble)).
  • Finally, all three different networks pre-trained with different sets of data, whose size are 1.5M, 5M and 8M respectively, are ensembled. Another boost is obtained with 52% mAP (unsup+ft (3 ensemble)).

5.3. ImageNet Pretrained Model

  • It is noted that ImageNet dataset is a set of labelled data.
  • When ImageNet pre-trained model is used, 50.1% mAP (RCNN 70K) is obtained.
  • The result of ensembling two of these networks is 53.6% mAP (RCNN 70K (2 ensemble)).
  • If three networks are ensembled, a mAP of 54.4% is obtained.

Diminishing returns are observed with ensembling since the training data remains similar.

5.4. Iterative Fine-Tuning Scheme

  • Given the proposed fine-tuned model using 5M patches in pre-training (unsup+ft, external = 5M), it is re-learnt and re-adapted to the unsupervised triplet task. After that, the network is re-applied to fine-tune on VOC 2012. The final result for this single model is 48% mAP (unsup + iterative ft), which is 1% better than the initial fine-tuned network.

5.5. Unsupervised Network Without Fine-Tuning

  • pool5 features are extracted using the proposed unsupervised-CNN and SVM is trained on top of it. mAP of 26.1% is obtained using the proposed unsupervised network (training with 8M data).
  • The ensemble of two unsupervised-network (training with 5M and 8M
  • data) gets mAP of 28.2%.
  • As a comparison, ImageNet pretrained network without fine-tuning gets mAP of 40.4%.
  • The successful implementation opens up a new space for designing unsupervised learning algorithms for CNN training. (There are also results for surface normal estimation, please feel free to read the paper if interested.)



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.