Review — Context Encoders: Feature Learning by Inpainting

Context Encoders for Inpainting & Self-Supervised Learning

6 min readAug 28, 2021

**Semantic Inpainting results on held-out images for context encoder trained using reconstruction and adversarial loss**

In this story, Context Encoders: Feature Learning by Inpainting, (Context Encoders), by University of California, Berkeley, is reviewed. In this story:

A CNN, called Context Encoders, is trained to generate the contents of an arbitrary image region conditioned on its surroundings.
Both a standard pixel-wise reconstruction loss, as well as a reconstruction plus an adversarial loss, are tested.
The Context Encoders can be used for feature learning. The learnt features can be usd pre-training on classification, detection, and segmentation tasks, which is a kind of self-supervised learning.

This is a paper in 2016 CVPR with over 3000 citations. (Sik-Ho Tsang @ Medium)

Outline

Context Encoders for Image Generation
Loss Function
Region Masks
Two CNN Architectures
Inpainting Results
Feature Learning Results

1. Context Encoders for Image Generation

1.1. Pipeline

The overall architecture is a simple encoder-decoder pipeline.
The encoder takes an input image with missing regions and produces a latent feature representation of that image.
The decoder takes this feature representation and produces the missing image content.
It is found to be important to connect the encoder and the decoder through a channel-wise fully-connected layer, which allows each unit in the decoder to reason about the entire image content.

1.2. Encoder

The encoder is derived from the AlexNet.
Given an input image of size 227×227, the first five convolutional layers and the following pooling layer (called pool5) are used to compute an abstract 6×6×256 dimensional feature representation.
In contrast to AlexNet, the proposed model is not trained for ImageNet classification; rather, the network is trained for context prediction “from scratch” with randomly initialized weights.
The latent feature dimension is 6×6×256 = 9216 for both encoder and decoder. Fully connecting the encoder and decoder would result in an explosion in the number of parameters (over 100M!), to the extent that efficient training on current GPUs would be difficult.

1.3. Channel-Wise Fully-Connected Layer

If the input layer has m feature maps of size n×n, this layer will output m feature maps of dimension n×n.
However, unlike a fully-connected layer, it has no parameters connecting different feature maps and only propagates information within feature maps.
Thus, the number of parameters in this channel-wise fully-connected layer is mn⁴, compared to m²n⁴ parameters in a fully-connected layer (ignoring the bias term).
This is followed by a stride 1 convolution to propagate information across channels.

1.4. Decoder

The channel-wise fully-connected layer is followed by a series of five up-convolutional layers. Each with a rectified linear unit (ReLU) activation function.
A up-convolutional is simply a convolution that results in a higher resolution image. It can be understood as upsampling followed by convolution, or convolution with fractional stride, i.e. deconvolution.

2. Loss function

There are 2 losses, one is reconstruction loss, one is adversarial loss.

2.1. Reconstruction Loss

The reconstruction (L2) loss is responsible for capturing the overall structure of the missing region and coherence with regards to its context, but tends to average together the multiple modes in predictions.
For each ground truth image x, the proposed context encoder F produces an output F(x).
Let ^M be a binary mask corresponding to the dropped image region with a value of 1 wherever a pixel was dropped and 0 for input pixels.
The reconstruction loss is a normalized masked L2 distance:

where ⊙ is the element-wise product operation.
L1 and L2 losses have no significant difference, it often fails to capture any high frequency details, often prefer a blurry solution, over highly accurate textures.
Finally, L2 loss is used because it predicts the mean of the distribution, because this minimizes the mean pixel-wise error, but results in a blurry averaged image.

2.2. The Adversarial Loss

The adversarial loss tries to make prediction look real, and has the effect of picking a particular mode from the distribution.
Only the generator (not the discriminator) is conditioned on context when trained using GAN. The adversarial loss for context encoders, Ladv, is:

Both F and D are optimized jointly using alternating SGD.

2.3. Joint Loss

The overall loss function is:

Currently, an adversarial loss is used only for inpainting experiments as AlexNet.

3. Region Masks

3 different strategies are proposed: Central region, random block and random region.

3.1. Central Region

The simplest such shape is the central square patch in the image.
The network learns low level image features that latch onto the boundary of the central mask.
Those low level image features tend not to generalize well to images without masks

3.2. Random Block

A number of smaller possibly overlapping masks, covering up to 1/4 of the image, is removed.
However, the random block masking still has sharp boundaries convolutional features could latch onto.

3.3. Random Region

Arbitrary shapes are removed from images.
This can prevent the network from learning low-level features corresponding to the removed mask.
In practice, it is found that region and random block masks produce a similarly general feature, while significantly outperforming the central region features.
The random region dropout is used for all our feature based experiments.

4. Two CNN Architectures

4.1. CNN for Inpainting

Context encoder trained with joint reconstruction and adversarial loss for semantic inpainting, is as shown above.
Center region dropout is used.

4.2. CNN for Feature Learning

**Context Encoder for Feature Learning**

Context encoder trained with reconstruction loss for feature learning.
Arbitrary region dropout is used.

5. Inpainting Results

**Comparison with Content-Aware Fill (Photoshop feature based on [2]) on held-out images.**

The proposed inpainting performs generally well as in the first figure in the story.
If a region can be filled with low-level textures, texture synthesis methods, such as [2, 11], can often perform better.

**Semantic Inpainting using different methods on held-out images**

**Semantic Inpainting accuracy for Paris StreetView dataset on held-out images.**

Nearest neighbor inpainting (NN) is compared.
The proposed reconstructions are well-aligned semantically.

6. Feature Learning Results

**Quantitative comparison for classification, detection and semantic segmentation**

AlexNet trained with the reconstruction loss is used for feature learning.
A random initialization performs roughly 25% below an ImageNet-trained model; however, it does not use any labels.
Context encoders are competitive with concurrent self-supervised feature learning methods [7, 39] (Context Prediction [7]) and significantly outperform autoencoders and Agrawal et al. [1].

For semantic segmentation, using proposed context encoders for pretraining (30.0%) outperform a randomly initialized network (19.8%) as well as a plain autoencoder (25.2%) which is trained simply to reconstruct its full input.

Reference

[2016 CVPR] [Context Encoders]
Context Encoders: Feature Learning by Inpainting

Self-Supervised Learning

2014 [Exemplar-CNN] 2015 [Context Prediction] 2016 [Context Encoders]