# Review: Semi-supervised Learning by Entropy Minimization

## Entropy Regularization Using Unlabeled Data

Semi-supervised Learning by Entropy MinimizationEntropy Minimization, by Heudiasyc, CNRS/UTC, and Université de Montréal2004 NIPS, Over 1400 Citations(Sik-Ho Tsang @ Medium)

Semi-Supervised Learning

**Unlabeled data**is used as a**regularization term**, by**entropy minimization**. This is a paper by research group of Prof. Bengio.

# Outline

**Entropy Minimization****Experimental Results**

**1. Entropy Minimization**

## 1.1. Labeled Data

- The dataset is denoted as
where*Ln*={*xi*,*zi*}*i*is from 1 to*n*, wherewhere*z*∈ {0, 1}^*K**K*is the number of class. - If
*xi*is labeled*ωk*, then*zik*=1 and*zil*=0 for*l*≠*k*, i.e. one-hot vector. - If
*xi*is unlabeled, then*zil*= 1 for*l*= 1, …,*K*. **Log-likelihood for labeled data**is:

- where
*fk*(*xi*) can be parameterized by*θ*. That means it can become*fk*(*xi*,*θ*) which can be a form of logistic regression, neural network, etc.

## 1.2. Unlabeled Data

**Unlabeled data**is used for**entropy minimization**, act as**regularization**:

- where
*gk*is*fk**for unlabeled data*.*θ*is dropped for simplicity. It can be in other forms such as using temperature hyperparameter for smoother output.

**2. Experimental Results**

- Consider
**two-class problems**in an 50-dimensional input space. Each class is generated with equal probability from a normal distribution. - Class
*ω*1 is normal with mean (*aa*. . .*a*) and unit covariance matrix. Class*ω*2 is normal with mean −(*aa*. . .*a*) and unit covariance matrix. Parameter*a*tunes the Bayes error which varies from 1% to 20 % (1%, 2.5%, 5%, 10%, 20%). - The learning sets comprise
*nl*labeled examples, (*nl*= 50, 100, 200) and*nu*unlabeled examples, (*nu*=*nl*×(1, 3, 10, 30, 100)). - Overall, 75 different setups are evaluated, and for each one, 10 different training samples are generated. Generalization performances are estimated on a test set of size 10000.
- Simple
**logistic regression**is used. - The overall error rates (averaged over all settings) are in favor of
**minimum entropy logistic regression (14.1±0.3**%**), better than logistic regression (14.9±0.3 %).** - For reference, the logistic regression reaches
**10.4 ± 0.1 % when all examples are labeled.**

(Since at that moment, there was no famous datasets used such as CIFAR or ImageNet, the model was not modern deep network, and the experimental setting was still developing, I’ve just review it in a very brief way. Please feel free to read the paper if interested.)

## Reference

[2004 NIPS] [Entropy Minimization, EntMin]

Semi-supervised Learning by Entropy Minimization

## Pretraining or Semi-Supervised Learning

**2004 **[Entropy Minimization, EntMin] **2013** [Pseudo-Label (PL)] **2015** [Ladder Network, Γ-Model] **2016 **[Sajjadi NIPS’16] **2017** [Mean Teacher] [PATE & PATE-G] [Π-Model, Temporal Ensembling] **2018 **[WSL] **2019 **[Billion-Scale] [Label Propagation] [Rethinking ImageNet Pre-training] **2020 **[BiT] [Noisy Student] [SimCLRv2]