Brief Review — Representation Learning by Learning to Count

Self-Supervised Learning By Counting Number of Visual Features

3 min readOct 11, 2022

Representation Learning by Learning to Count,
Counting, by University of Bern, and University of Maryland
2017 ICCV, Over 300 Citations (Sik-Ho Tsang @ Medium)
Self-Supervised Learning, Image Classification

Self-supervised learning is achieved by counting number of visual features.

Outline

Counting
Results

1. Counting

1.1. Conceptual Idea

**The number of visual primitives in the whole image should match the sum of the number of visual primitives in each tile**

To obtain a supervision signal useful to learn to count, an image is partitioned into non-overlapping regions. The number of visual primitives in each region should sum up to the number of primitives in the original image.

It is hypothesized that the model needs to disentangle the image into high-level factors of variation, such that the complex relation between the original image and its regions is translated to a simple arithmetic operation.

1.2. Contrastive Loss

Assume x is color image input, the naïve way for training the network, D is downsampling operator, T is tiling operator to divide the image x into 4 non-overlapping parts:

where Φ is the CNN to be learnt to count the visual features.
However, this loss has trivial solution of equaling to zero.

To avoid such a scenario, a contrastive loss is used to enforce that the counting feature should be different between two randomly chosen different images.

Therefore, for any x≠y, we would like to minimize:

where the constant scalar M=10.
The contrastive term will introduce a tradeoff that will push features towards counting as many primitives as is needed to differentiate images from each other.

1.3. Network Architecture

Input is 114×114 image.
AlexNet with ReLU at the end is used.
Final layer has 1000 outputs to count 1000 visual features.
The counting network is trained on the 1.3M images from the training set of ImageNet.

2. Results

**Evaluation of transfer learning on PASCAL**

The proposed method either outperforms previous methods or achieve the second best performance.

**ImageNet classification with a linear classifier**

**Places classification with a linear classifier**

The proposed method achieves a performance comparable to the other state-of-the-art methods on the ImageNet dataset and shows a significant improvement on the Places dataset.

Reference

[2017 ICCV] [Counting]
Representation Learning by Learning to Count

1.2. Unsupervised/Self-Supervised Learning

1993 … 2017 [Counting] … 2021 [MoCo v3] [SimSiam] [DINO] [Exemplar-v1, Exemplar-v2] [MICLe] [Barlow Twins] [MoCo-CXR] [W-MSE] [SimSiam+AL] [BYOL+LP] 2022 [BEiT] [BEiT V2]

Brief Review — Representation Learning by Learning to Count

Self-Supervised Learning By Counting Number of Visual Features

Outline

1. Counting

1.1. Conceptual Idea

1.2. Contrastive Loss

1.3. Network Architecture

2. Results

Reference

1.2. Unsupervised/Self-Supervised Learning

My Other Previous Paper Readings

Written by Sik-Ho Tsang

No responses yet