Review — Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles

Solving Jigsaw Puzzles as Pretext Task for Self-Supervised Learning

4 min readSep 12, 2021

Learning image representations by solving Jigsaw puzzles. (a): The image from which the tiles (marked with green lines) are extracted. (b): A puzzle obtained by shuffling the tiles. (c): determining the relative position (the relative location between the central tile and the top-left and top-middle tiles is ambiguous.)

In this paper, Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles, Jigsaw Puzzles / CFN, by University of Bern, is reviewed. In this paper:

Solving Jigsaw puzzles is treated as a pretext task, which requires no manual labeling. By training the CFN to solve Jigsaw puzzles, both a feature mapping of object parts and their correct spatial arrangement, are learnt.
Specifically, the Context Free Network (CFN), a siamese-ennead CNN, is designed to take image tiles as input, and outputs the correct spatial arrangement. By this mean, useful feature representation is learnt and used for several transfer learning benchmarks (downstream tasks).

This is a paper in 2016 ECCV with over 1200 citations. (Sik-Ho Tsang @ Medium)

Outline

Feature Learning by Solving Jigsaw Puzzles
Context Free Network (CFN): Network Architecture
Experimental Results

1. Feature Learning by Solving Jigsaw Puzzles

1.1. Conceptual Idea

**Most of the shape of these 2 pairs of images is the same**

Two cars that have different colors and two dogs with different fur patterns. The features learned to solve puzzles in one (car/dog) image will apply also to the other (car/dog) image as they will be invariant to shared patterns.

1.2. Naïve Stacked Patches NOT Working

An immediate approach to solve Jigsaw puzzles is to stack the tiles of the puzzle along the channels (i.e., the input data would have 9×3=27 channels), and input these channels into a CNN to solve the Jigsaw puzzles.
The problem with this design is that the network prefers to identify correlations between low-level texture statistics across tiles rather than between the high-level primitives.
A CNN with only low-level features learnt is NOT what we want.

Late fusion is used to force the proposed CFN to learn high-level features.

2. Context Free Network (CFN): Network Architecture

2.1. Framework

CFN is designed to force the network to learn high-level features.

First of all, the Jigsaw puzzle is permutated before inputting into AlexNet-like CNN.
Then, each patch/puzzle goes through the CNN. In this case, puzzles will NOT communicate each others within the network in early layers.
Each branch shares the same weights.
There are a lot of permutation choices. The task is to predict the index of the chosen permutation (technically, we define as output a probability vector with 1 at the 64-th location and 0 elsewhere).

2.2. Training

During training, each input image is resized until either the height or the width matches 256 pixels and preserve the original aspect ratio.
Then, a random region is cropped from the resized image of size 225×225 and is split into a 3×3 grid of 75×75 pixels tiles.

2.3. Avoid Shortcuts

One important point is to avoid shortcuts, i.e. avoid the network to learn low-level features which makes the network forgot to learn the high-level features.

A 64×64 region is extracted from each tile by introducing random shifts and feed them to the network. Thus, an average gap of 11 pixels (0 to 22) are between the tiles. This can help to avoid shortcuts due to edge continuity.
To avoid shortcuts due to chromatic aberration, the color channels are jittered.

3. Experimental Results

The training uses 1.3M color images of 256×256 pixels from ImageNet. Then the model is transferred to other tasks.

3.1. ImageNet

The CFN weights are used to initialize all the conv layers of a standard AlexNet network. Then, the rest of the network is retrained from scratch (Gaussian noise as initial weights) for object classification on ImageNet dataset.
If AlexNet is trained with labels, the reference maximum accuracy is of 57.4%.

The proposed method CFN achieves 34.6% when only fully connected layers are trained.
There is a significant improvement (from 34.6% to 45.3%) when the conv5 layer is also trained. This shows that the conv5 layer starts to be specialized on the Jigsaw puzzle reassembly task.

3.2. PASCAL VOC

**Results on PASCAL VOC 2007 Detection and Classification**

For detection, Fast R-CNN framework is used while for segmentation, FCN framework is used.
The trained features using CFN achieve 53.2% mAP using multi-scale training and testing, 67.6% in classification, and 37.6% in semantic segmentation thus outperforming all other methods such as Context Prediction [10], and Context Encoders [30].

Jigsaw Puzzles or CFN is closing the gap with features obtained with supervision (Supervised AlexNet [25]).

Reference

[2016 ECCV] [Jigsaw Puzzles / CFN]
Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles

Self-Supervised Learning

2008–2010 [Stacked Denoising Autoencoders] 2014 [Exemplar-CNN] 2015 [Context Prediction] 2016 [Context Encoders] [Colorization] [Jigsaw Puzzles / CFN] 2017 [L³-Net] [Split-Brain Auto]