Review: Semi-supervised Learning by Entropy Minimization

Entropy Regularization Using Unlabeled Data

3 min readApr 19, 2022

Semi-supervised Learning by Entropy Minimization
Entropy Minimization, by Heudiasyc, CNRS/UTC, and Université de Montréal
2004 NIPS, Over 1400 Citations (Sik-Ho Tsang @ Medium)
Semi-Supervised Learning

Unlabeled data is used as a regularization term, by entropy minimization. This is a paper by research group of Prof. Bengio.

Outline

Entropy Minimization
Experimental Results

1. Entropy Minimization

1.1. Labeled Data

The dataset is denoted as Ln={xi, zi} where i is from 1 to n, where z ∈ {0, 1}^K where K is the number of class.
If xi is labeled ωk, then zik=1 and zil=0 for l≠k, i.e. one-hot vector.
If xi is unlabeled, then zil= 1 for l= 1, …, K.
Log-likelihood for labeled data is:

where fk(xi) can be parameterized by θ. That means it can become fk(xi,θ) which can be a form of logistic regression, neural network, etc.

1.2. Unlabeled Data

Unlabeled data is used for entropy minimization, act as regularization:

where gk is fk actually for unlabeled data. θ is dropped for simplicity. It can be in other forms such as using temperature hyperparameter for smoother output.

2. Experimental Results

Consider two-class problems in an 50-dimensional input space. Each class is generated with equal probability from a normal distribution.
Class ω1 is normal with mean (aa . . . a) and unit covariance matrix. Class ω2 is normal with mean −(aa . . . a) and unit covariance matrix. Parameter a tunes the Bayes error which varies from 1% to 20 % (1%, 2.5%, 5%, 10%, 20%).
The learning sets comprise nl labeled examples, (nl = 50, 100, 200) and nu unlabeled examples, (nu=nl×(1, 3, 10, 30, 100)).
Overall, 75 different setups are evaluated, and for each one, 10 different training samples are generated. Generalization performances are estimated on a test set of size 10000.
Simple logistic regression is used.
The overall error rates (averaged over all settings) are in favor of minimum entropy logistic regression (14.1±0.3 %), better than logistic regression (14.9±0.3 %).
For reference, the logistic regression reaches 10.4 ± 0.1 % when all examples are labeled.

(Since at that moment, there was no famous datasets used such as CIFAR or ImageNet, the model was not modern deep network, and the experimental setting was still developing, I’ve just review it in a very brief way. Please feel free to read the paper if interested.)