Brief Review — Self-Supervised Learning of Visual Features Through Embedding Images into Text Topic Spaces

TextTopicNet, Train CNN Using Text Topics

Sik-Ho Tsang
4 min readOct 9, 2022
Illustrated Wikipedia articles about specific entities

Self-Supervised Learning of Visual Features Through Embedding Images into Text Topic Spaces,
, by UAB, and IIIT Hyderabad
2017 CVPR, Over 80 Citations (Sik-Ho Tsang @ Medium)
Self-Supervised Learning, Image Classification, Object Detection

  • Visual features are learnt by mapping between topics and images.


  1. TextTopicNet
  2. Results

1. TextTopicNet

TextTopicNet Framework

1.1. Latent Dirichlet Allocation (LDA)

  • LDA algorithm is a generative statistical model of a text corpus to extract the topics.
  • LDA can be represented as a three level hierarchical Bayesian model.

Each topic is characterized by a probability distribution over words.

1.2. Convolutional Neural Network (CNN)

  • Two CNNs are tried.
  • One is the 8 layers CNN CaffeNet, a modified version of AlexNet.
  • The other architecture is a 6 layers CNN resulting from removing the 2 first convolutional layers from CaffeNet.
  • For learning to predict the target topic probability distributions, a sigmoid cross-entropy loss is minimized on the image dataset.

1.3. Pretraining Dataset

  • The ImageCLEF 2010Wikipedia collection consists of 237,434 Wikipedia images and the Wikipedia articles that contain these images.
  • After filtering, the training data is composed of 100,785 images and 35,582 unique articles.
Top-5 most relevant words for 3 of the discovered topics by LDA analysis (left), and top-5 most relevant images for the same topics (right). Overall word frequency is shown in blue, and estimated word frequency within the topic in red

We can observe that the most representative images for each topic present some regularities and thus allow the CNN to learn discriminative features.

2. Results

2.1. Image Classification

One vs. Rest linear SVM validation %mAP on PASCAL VOC2007 by varying number of topics of LDA
  • After pretraining, features are extracted from top layers of the CNN and one vs. rest linear SVMs are trained for image classification in PASCAL VOC2007 dataset.

It is found that the best validation performance is obtained for 40 topics.

PASCAL VOC2007 per-class average precision (AP) scores for the classification task with pool5 features

From the above table, the proposed TextTopicNet using pool5 features are substantially more discriminative than the rest for the most difficult classes, see e.g. “bottle”, “pottedplant” or “cow”.

PASCAL VOC2007 %mAP image classification
  • TextTopicNet (COCO): corresponds to a model trained with MS-COCO images and their ground-truth caption annotations as textual content.

TextTopicNet (Wiki) features, learned from a very noisy signal, perform surprisingly well compared with the ones of the TextTopicNet (COCO) model.

2.2. Multimodal Image Retrieval

MAP comparison on Wikipedia dataset [29] with supervised (bottom) and unsupervised (middle) methods
  • For retrieval, images and documents are projected into the learned topic space and the KL-divergence distance of the query (image or text) is computed with all the entities in the database.

The proposed self-supervised TextTopicNet outperforms unsupervised approaches, and has competitive performance to supervised methods without using any labeled data.

Top 4 nearest neighbors for a given query image image (left-most). Each row makes use of features obtained from different layers of TextTopicNet (without fine tuning). From top to bottom: prob, fc7, fc6, pool5.

The results are semantically close, although not necessarily visually similar.

Top 10 nearest neighbors for a given text query (from left to right: “airplane”, “bird”, and “horse”) in the topic space of TextTopicNet

By leveraging textual semantic information, TextTopicNet learns a polysemic representation of images.

(There are also other results such as object detection results, please feel free to read the paper directly.)

Self-supervised learning is done by utilizing text passage and image in Wikipedia.


[2017 CVPR] [TextTopicNet]
Self-Supervised Learning of Visual Features Through Embedding Images into Text Topic Spaces

1.2. Unsupervised/Self-Supervised Learning

19932017 [TextTopicNet] … 2021 [MoCo v3] [SimSiam] [DINO] [Exemplar-v1, Exemplar-v2] [MICLe] [Barlow Twins] [MoCo-CXR] [W-MSE] [SimSiam+AL] [BYOL+LP] 2022 [BEiT] [BEiT V2]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.