Review: Semi-Supervised Learning with Ladder Networks

Ladder Network, Γ-Model: Minimize Cost of Latent Features

  • The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions, built on the top of the Ladder Network.

Outline

  1. Minimizing Deep Features
  2. Ladder Network & Γ-Model
  3. Experimental Results

1. Minimizing Deep Features

1.1. Denoising Autoencoder

Denoising Autoencoder (Figure from a Korean Language Presentation)
  • In Denoising Autoencoder, noise is added into the clean input x, to become ~x. ~x is input to autoencoder. Then, the autoencoder is trying to reconstruct ^x which is as close as x.
  • By doing so, the deep latent feature at the middle has rich feature information which can be used for fine-tuning on other datasets.
  • To train the Denoising Autoencoder, the cost is to minimize the reconstructed output ^x and clean input x:
  • However, there is no cost function to minimize the difference of the latent feature at the middle.

1.2. Minimizing Deep Feature Difference

Minimizing Deep Features (Figure from a Korean Language Presentation)
  • One way is to directly minimize the deep feature difference of z. The cost function is identical to that used in a Denoising Autoencoder except that latent variables z replace the observations x:

2. Ladder Network & Γ-Model

Ladder Network (Figure from a Korean Language Presentation)
  • (Here, only the conceptual idea is presented.)

2.1. Ladder Network

  • x is the image input and y is the output which can be the label.
  • Since the cost function needs both the clean z(l) and corrupted ˜z(l), during training, the encoder is run twice: a clean pass for z(l) and a corrupted pass for ˜z(l).
  • g is a denoising function with the inputs, from the previous layer and also from the corresponding layer at the corrupted path.

2.2. Γ-Model

Γ-Model (Figure from a Korean Language Presentation)
  • Γ-Model is the simple special case of the Ladder Network.
  • This corresponds to a denoising cost only on the top layer and means that most of the decoder can be omitted.

3. Experimental Results

3.1. Fully Connected MLP on MNIST

MNIST test errors in the permutation invariant setting
  • The baseline MLP model is 784–1000–500–250–250–250–10.
  • Encouraged by the good results, we also tested with N=50 labels and got a test error of 1.62 %.
  • With N=100 labels, all models sometimes failed to converge properly.

3.2. CNN on MNIST

CNN results for MNIST
  • 2 models, Conv-FC and Conv-Small using Γ-Model, are used.

3.3. CNN on CIFAR-10

Test results for CNN on CIFAR-10 dataset without data augmentation

References

[2015 NIPS] [Ladder Network, Γ-Model]
Semi-Supervised Learning with Ladder Networks

[Korean Language Presentation]

Pretraining or Weakly/Semi-Supervised Learning

2013 [Pseudo-Label (PL)] 2015 [Ladder Network, Γ-Model] 2016 [Sajjadi NIPS’16] 2017 [Mean Teacher] 2018 [WSL] 2019 [Billion-Scale] [Label Propagation] [Rethinking ImageNet Pre-training] 2020 [BiT] [Noisy Student] [SimCLRv2]

My Other Previous Paper Readings

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store