Review — Stacked Denoising Autoencoders (Self-Supervised Learning)

One of the Earliest Reconstruction-Based Self-Supervised Learning Approaches, Using Denoising Autoencoders/Stacked Denoising Autoencoders

Stacked Autoencoder (Figure from Setting up stacked autoencoders)

In this story, Extracting and Composing Robust Features with Denoising Autoencoders, (Denoising Autoencoders/Stacked Denoising Autoencoders), by Universite de Montreal, is briefly reviewed. This is a paper by Prof. Yoshua Bengio’s research group. In this paper:

  • Denoising Autoencoder is designed to reconstruct a denoised image from a noisy input image.
  • By training the denoising autoencoder, feature learning is achieved without using any labels, which is then used for fine-tuning in image classification tasks.
  • This paper should be one of the early papers for self-supervised learning.

This is a paper in 2008 ICML with over 5800 citations. And later published in 2010 JMLR with over 6200 citations. (Sik-Ho Tsang @ Medium)

Outline

  1. Denoising Autoencoder
  2. Stacked Denoising Autoencoder
  3. Fine-Tuning for Image Classification
  4. Experimental Results

1. Denoising Autoencoder

Denoising Autoencoder
  • x: Original input image.
  • ~x: Corrupted image.
  • y: Hidden representation:
  • z: Reconstructed image.
  • Autoencoder consists of an encoder and a decoder.
  • Encoder: The corrupted input ~x is first mapped to a hidden representation y.
  • Decoder: Then the cleaned input z is reconstructed from y.

The above autoencoder only got one layer at encoder, and one gθ at decoder.

2. Stacked Denoising Autoencoder

Stacking Denoising Autoencoders
  • To train a deep autoencoder, at that time, it was difficult to train. The autoencoder is trained layer-by-layer at that moment.
  • Left: After training a first level denoising autoencoder (i.e. in the first figure), its learnt encoding function is used on clean input (left).
  • Middle: The resulting representation is used to train a second level denoising autoencoder to learn a second level encoding function f(2)θ.
  • Right: From there, the procedure can be repeated to have deeper model.

3. Fine-Tuning for Image Classification

Fine-tuning of a deep network for classification
  • After training a stack of encoders as explained in the previous figure, an output layer is added on top of the stacked layers of the encoder part.
  • The parameters of the whole system are fine-tuned to minimize the error in predicting the supervised target (e.g., class), by performing gradient descent on a supervised cost.
Training and Fine-Tuning of an Autoencoder (Figure from Setting up stacked autoencoders)
  • The above figure shows the general steps for pre-training using autoencoder, and fine-tuning using encoder.

4. Experimental Results

Data sets (Characteristics of the 10 different problems considered)
Samples form the various image classification problems
  • Different corruptions are added to the dataset for testing.
  • rot: Rotation.
  • bg-rand: Addition of a background composed of random pixels
  • bg-img: Addition of a background composed of patches extracted from a set of image, etc.
Comparison of stacked denoising autoencoders (SDAE-3) with other models.
  • SDAE-3: Neural networks with 3 hidden layers initialized by stacking denoising autoencoders.
  • The encoder part is fine-tuned on the classification tasks.

SDAE-3 algorithm performs on par or better than the best other algorithms, including deep belief nets.

Unsupervised initialization of layers with an explicit denoising criterion helps to capture interesting structure in the input distribution.

This in turn leads to intermediate representations much better suited for subsequent learning tasks such as supervised classification.

--

--

--

PhD, Researcher. I share what I've learnt and done. :) My LinkedIn: https://www.linkedin.com/in/sh-tsang/, My Paper Reading List: https://bit.ly/33TDhxG

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Review — Grid R-CNN (Object Detection)

Review — CPVT: Conditional Positional Encodings for Vision Transformers

Why Should You Care About Deep Learning?

Complete Step-by-Step Guide to Build a Custom Object Detection Model with YOLOv5 — Part 2

Review — DeFusionNet: Defocus Blur Detection via Recurrently Fusing and Refining Multi-Scale Deep…

Reading: StairNet — Top-Down Semantic Aggregation (Object Detection)

AUTOMATED ML PIPELINE

Creating Word Embeddings: Coding the Word2Vec Algorithm in Python using Deep Learning

Get the Medium app

Sik-Ho Tsang

Sik-Ho Tsang

PhD, Researcher. I share what I've learnt and done. :) My LinkedIn: https://www.linkedin.com/in/sh-tsang/, My Paper Reading List: https://bit.ly/33TDhxG

More from Medium

Review — DeiT: Data Efficient Image Transformer

Pooling layers in Neural nets and their variants

Paper Summary [Deep Deterministic Uncertainty for Semantic Segmentation]