Brief Review — Self-Supervised Learning of Visual Features Through Embedding Images into Text Topic Spaces
TextTopicNet, Train CNN Using Text Topics
Self-Supervised Learning of Visual Features Through Embedding Images into Text Topic Spaces,
TextTopicNet, by UAB, and IIIT Hyderabad
2017 CVPR, Over 80 Citations (Sik-Ho Tsang @ Medium)
Self-Supervised Learning, Image Classification, Object Detection
- Visual features are learnt by mapping between topics and images.
Outline
- TextTopicNet
- Results
1. TextTopicNet
1.1. Latent Dirichlet Allocation (LDA)
- LDA algorithm is a generative statistical model of a text corpus to extract the topics.
- LDA can be represented as a three level hierarchical Bayesian model.
Each topic is characterized by a probability distribution over words.
1.2. Convolutional Neural Network (CNN)
- Two CNNs are tried.
- One is the 8 layers CNN CaffeNet, a modified version of AlexNet.
- The other architecture is a 6 layers CNN resulting from removing the 2 first convolutional layers from CaffeNet.
- For learning to predict the target topic probability distributions, a sigmoid cross-entropy loss is minimized on the image dataset.
1.3. Pretraining Dataset
- The ImageCLEF 2010Wikipedia collection consists of 237,434 Wikipedia images and the Wikipedia articles that contain these images.
- After filtering, the training data is composed of 100,785 images and 35,582 unique articles.
We can observe that the most representative images for each topic present some regularities and thus allow the CNN to learn discriminative features.
2. Results
2.1. Image Classification
- After pretraining, features are extracted from top layers of the CNN and one vs. rest linear SVMs are trained for image classification in PASCAL VOC2007 dataset.
It is found that the best validation performance is obtained for 40 topics.
From the above table, the proposed TextTopicNet using pool5 features are substantially more discriminative than the rest for the most difficult classes, see e.g. “bottle”, “pottedplant” or “cow”.
- TextTopicNet (COCO): corresponds to a model trained with MS-COCO images and their ground-truth caption annotations as textual content.
TextTopicNet (Wiki) features, learned from a very noisy signal, perform surprisingly well compared with the ones of the TextTopicNet (COCO) model.
2.2. Multimodal Image Retrieval
- For retrieval, images and documents are projected into the learned topic space and the KL-divergence distance of the query (image or text) is computed with all the entities in the database.
The proposed self-supervised TextTopicNet outperforms unsupervised approaches, and has competitive performance to supervised methods without using any labeled data.
The results are semantically close, although not necessarily visually similar.
By leveraging textual semantic information, TextTopicNet learns a polysemic representation of images.
(There are also other results such as object detection results, please feel free to read the paper directly.)
Self-supervised learning is done by utilizing text passage and image in Wikipedia.
Reference
[2017 CVPR] [TextTopicNet]
Self-Supervised Learning of Visual Features Through Embedding Images into Text Topic Spaces
1.2. Unsupervised/Self-Supervised Learning
1993 … 2017 [TextTopicNet] … 2021 [MoCo v3] [SimSiam] [DINO] [Exemplar-v1, Exemplar-v2] [MICLe] [Barlow Twins] [MoCo-CXR] [W-MSE] [SimSiam+AL] [BYOL+LP] 2022 [BEiT] [BEiT V2]