Brief Review — Self-Supervised Feature Learning by Learning to Spot Artifacts

Use GAN to Spot Artifact, Pretrained Discriminator Uses for Downstream Tasks

Sik-Ho Tsang
5 min readOct 12, 2022
A mixture of real images (green border) and images with synthetic artifacts (red border). Is a good object representation necessary to tell them apart?

Self-Supervised Feature Learning by Learning to Spot Artifacts
Spot Artifacts
, by University of Bern
2018 CVPR, Over 100 Citations (

@ Medium)
Self-Supervised Learning, GAN, Image Classification, Object Detection

  • A discriminator network is trained to distinguish real images from images with synthetic artifacts.
  • This pretrained discriminator is then used for downstream tasks.


  1. Learning GAN to Spot Artifacts
  2. Results

1. Learning GAN to Spot Artifacts

Proposed Architectures. Two Autoencoders output either real images (top row) or images with artifacts (bottom row). Discriminator C learns to localize all artifacts.

1.1. Autoencoder

  • Two Autoencoder networks {E, D1, D2, D3, D4, D5}, where E is the encoder and D={D1, D2, D3, D4, D5} is the decoder, pre-trained to reproduce high-fidelity real images x; Φ(x) is the output of the encoder E on x.
  • A spatial mask Ω is to be applied to the feature output of E of the size M×N to randomly drop of image features with a given probability θ ∈ (0, 1), which forms the binary spatial mask Ω.
  • The input to the first decoder layer D1 as:
  • with
  • where u is a large uniform filter. At the later layers D2, D3, and D4, the mask is upsampled to match the resolution of the decoder.
  • The following input is provided to each layer Di:
  • where Ui-1 denotes the nearest neighbor upsampling to the spatial dimensions of the output of Di-1.

1.2. Repair Network Layer

Repair Network Layer
  • A repair network {R1, R2, R3, R4, R5} is added to the layers of the bottom decoder networks. The output of a layer Ri is masked by Ω so that it affects only masked features.

1.3. Discriminator

  • A discriminator network C to classify x as real images and ^x as fake; and also localize the artifacts.

1.4. Loss Functions

  • The Autoencoder is firstly pretrained using the reconstruction loss. Thus, the training of E and D is a standard optimization with the following least-squares objective:
  • Then, for the adversarial loss, the discriminator C is to classify x as real images and ^x as fake:
  • In addition, the discriminator C is trained to predict the mask by minimizing:
  • where:
  • And σ is the sigmoid function.

1.5. Architecture

  • Let (64)3c2 denote a convolutional layer with 64 filters of size 3×3 with a stride of 2.
  • The architecture of the encoder E is then defined by (32)3c1-(64)2c2-(128)2c2-(256)2c2-(512)2c2.
  • The decoder network D is given by (256)3rc2-(128)3rc2-(64)3rc2-(32)3rc2-(3)3c1 where rc denotes resize-convolutions (i.e., bilinear resizing followed by standard convolution).
  • The discriminator network C is based on the standard AlexNet up to conv5, where a single 3×3 convolutional layer is used for the mask prediction. For the classification, the second fully-connected layer is removed.

2. Results

2.1. Analysis & Ablation

Two examples of corrupt images obtained from our damage & repair network
  • (a): Original images and masks.
  • (b): Corrupted images by 50% drop rate.
  • (c): The output of the decoder when the repair network is not active. In this case the artifacts are very visible and easy to detect by exploiting low-level statistics.
  • (d): The output of the decoder when the repair network is not masked. The repair network is then able to change the image more globally. This has a negative effect on the discriminator as it fails to predict the mask.
  • (e): An example where the images are fed through the damage & repair network twice. This results in even more artifacts than in (b).
Influence of different architectural choices on the classification accuracy on STL-10

The baseline performs the best.

2.2. SOTA Comparison on PASCAL VOC

Comparison of test-set accuracy on STL-10 with other published results.

It is observed that there is an increase in performance over the other methods.

Transfer learning results for classification, detection and segmentation on Pascal VOC2007 and VOC2012 compared to state-of-the-art feature learning methods.

With a mAP of 69.8% we achieve a state-of-the-art performance on the classification task.

With a mAP of 52.5%, the second best result is achieved.

  • FCN framework is used for segmentation.

A state-of-the-art result with a mean intersection over union (mIoU) of 38.1%, is achieved.

2.3. Layer-wise Performance

Validation set accuracy on ImageNet with linear classifiers trained on the frozen convolutional layers after unsupervised pre-training.

On ImageNet, the proposed model outperforms all other approaches in this benchmark.

Validation set accuracy on Places with linear classifiers trained on the frozen convolutional layers after unsupervised pretraining.

On Places, the proposed method outperforms all the other methods for layers conv2-conv5. Note also that the highest overall accuracy with 37.3% is achieved.


[2018 CVPR] [Spot Artifacts]
Self-Supervised Feature Learning by Learning to Spot Artifacts

1.2. Unsupervised/Self-Supervised Learning

19932018 [Spot Artifacts] … 2021 [MoCo v3] [SimSiam] [DINO] [Exemplar-v1, Exemplar-v2] [MICLe] [Barlow Twins] [MoCo-CXR] [W-MSE] [SimSiam+AL] [BYOL+LP] 2022 [BEiT] [BEiT V2]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.