Review — Stacked Denoising Autoencoders (Self-Supervised Learning)

One of the Earliest Reconstruction-Based Self-Supervised Learning Approaches, Using Denoising Autoencoders/Stacked Denoising Autoencoders

Sik-Ho Tsang
4 min readSep 4, 2021
Stacked Autoencoder (Figure from Setting up stacked autoencoders)

In this story, Extracting and Composing Robust Features with Denoising Autoencoders, (Denoising Autoencoders/Stacked Denoising Autoencoders), by Universite de Montreal, is briefly reviewed. This is a paper by Prof. Yoshua Bengio’s research group. In this paper:

  • Denoising Autoencoder is designed to reconstruct a denoised image from a noisy input image.
  • By training the denoising autoencoder, feature learning is achieved without using any labels, which is then used for fine-tuning in image classification tasks.
  • This paper should be one of the early papers for self-supervised learning.

This is a paper in 2008 ICML with over 5800 citations. And later published in 2010 JMLR with over 6200 citations. (

@ Medium)


  1. Denoising Autoencoder
  2. Stacked Denoising Autoencoder
  3. Fine-Tuning for Image Classification
  4. Experimental Results

1. Denoising Autoencoder

Denoising Autoencoder
  • x: Original input image.
  • ~x: Corrupted image.
  • y: Hidden representation:
  • z: Reconstructed image.
  • Autoencoder consists of an encoder and a decoder.
  • Encoder: The corrupted input ~x is first mapped to a hidden representation y.
  • Decoder: Then the cleaned input z is reconstructed from y.

The above autoencoder only got one layer at encoder, and one gθ at decoder.

2. Stacked Denoising Autoencoder

Stacking Denoising Autoencoders
  • To train a deep autoencoder, at that time, it was difficult to train. The autoencoder is trained layer-by-layer at that moment.
  • Left: After training a first level denoising autoencoder (i.e. in the first figure), its learnt encoding function is used on clean input (left).
  • Middle: The resulting representation is used to train a second level denoising autoencoder to learn a second level encoding function f(2)θ.
  • Right: From there, the procedure can be repeated to have deeper model.

3. Fine-Tuning for Image Classification

Fine-tuning of a deep network for classification
  • After training a stack of encoders as explained in the previous figure, an output layer is added on top of the stacked layers of the encoder part.
  • The parameters of the whole system are fine-tuned to minimize the error in predicting the supervised target (e.g., class), by performing gradient descent on a supervised cost.
Training and Fine-Tuning of an Autoencoder (Figure from Setting up stacked autoencoders)
  • The above figure shows the general steps for pre-training using autoencoder, and fine-tuning using encoder.

4. Experimental Results

Data sets (Characteristics of the 10 different problems considered)
Samples form the various image classification problems
  • Different corruptions are added to the dataset for testing.
  • rot: Rotation.
  • bg-rand: Addition of a background composed of random pixels
  • bg-img: Addition of a background composed of patches extracted from a set of image, etc.
Comparison of stacked denoising autoencoders (SDAE-3) with other models.
  • SDAE-3: Neural networks with 3 hidden layers initialized by stacking denoising autoencoders.
  • The encoder part is fine-tuned on the classification tasks.

SDAE-3 algorithm performs on par or better than the best other algorithms, including deep belief nets.

Unsupervised initialization of layers with an explicit denoising criterion helps to capture interesting structure in the input distribution.

This in turn leads to intermediate representations much better suited for subsequent learning tasks such as supervised classification.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.