Review — Multi-task Self-Supervised Visual Learning

Pretrain Using Multiple Pretext Tasks to Improve Downstream Task Accuracy


1. Multi-Task Network

The structure of our multi-task network based on ResNet-101, with block 3 having 23 residual units. a) Naïve shared-trunk approach, where each “head” is attached to the output of block 3. b) the lasso architecture, where each “head” receives a linear combination of unit outputs within block3, weighted by the matrix , which is trained to be sparse.

1.1. Common Trunk

1.2. Separating Features via Lasso

1.3. Harmonizing Network Inputs

1.4. Distributed Network Training

Distributed training setup

2. Model Fine-Tuning

2.1. ImageNet

2.2. PASCAL VOC 2007 Detection

2.3. NYU V2 Depth Prediction

2. Experimental Results

2.1. Individual Self-Supervised Training Performance

Individual Self-Supervised Training Performance
Comparison of performance for different self-supervised methods over time

2.2. Naïve Multi-Task Combination of Self-Supervision Tasks

Comparison of various combinations of self-supervised tasks RP: Relative Position (Context Prediction); Col: Colorization; Ex: Exemplar Nets; MS: Motion Segmentation (Motion Masks). Metrics: ImageNet: Recall@5; PASCAL: mAP; NYU: % Pixels below 1.25.

2.3. Harmonization

Comparison of methods with and without harmonization, H: harmonization

2.4. Lasso

Comparison of performance with and without the lasso technique for factorizing representations, for a network trained on all four self-supervised tasks for 16.8K GPU-hours.



PhD, Researcher. I share what I learn. :) Reads:, LinkedIn:, Twitter:

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store