# Review — PATE & PATE-G: Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data

## PATE — Private Aggregation of Teacher Ensembles: Training Student Using Ensemble of Teachers

Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data, by Pennsylvania State University, Google Brain, & Google

PATE & PATE-G2017 ICLR, Over 600 Citations(Sik-Ho Tsang @ Medium)Semi-Supervised Learning, Teacher Student, Image Classification

**Private Aggregation of Teacher Ensembles (PATE)**is proposed, in which**multiple teachers trained with disjoint datasets.**Because they rely directly on sensitive data, these teachers are not published.**Then, the student learns to predict an output chosen by noisy voting among all of the teachers**, and cannot directly access an individual teacher or the underlying data or parameters.- This is a paper from Ian Goodfellow, who invented GAN.
**PATE-G is further proposed which utilize****GAN****for PATE.**

# Outline

**Private Aggregation of Teacher Ensembles (PATE)****Experimental Results**

**1. Private Aggregation of Teacher Ensembles (PATE)**

## 1.1. Data Partitioning and Teachers

- Instead of training a single model to solve the task associated with
**dataset (**, where*X*,*Y*)denotes the set of*X***inputs**, andthe set of*Y***labels**, the data is partitioned in*n*disjoint sets (*Xn*,*Yn*).

A teacher model is trained separately on each set. n classifiers fi called teachers.

- Then they are deployed as
**an ensemble making predictions on unseen inputs**.*x*by querying each teacher for a prediction*fi*(*x*) - These predictions are
**aggregated into a single prediction.**

## 1.2. **Aggregation**

- Let
*m*be the number of classes in the task. - The label counts for each class are count.
- Yet, if we
**simply apply plurality**— use the label with the largest count —**the ensemble’s decision may depend on a single teacher’s vote.** **If two classes have close vote counts, the disagreement**

may reveal private information.

Random noise is added to the vote countsnjto introduce ambiguity.

*γ*is a privacy parameter and*Lap*(*b*) is the Laplacian distribution with location 0 and scale*b*.- The parameter
*γ*influences the privacy guarantee.

Intuitively,

a large, as the noisy maximumγleads to a strong privacy guarantee, but can degrade the accuracy of the labelsfabove can differ from the true plurality.

- Next step is to train another model, the student, using a fixed number of labels predicted by the teacher ensemble.

## 1.3. Knowledge from an Ensemble to a Student

- A student is trained on non-sensitive and unlabeled data which are labeled using the aggregation mechanism.
- This student model is the one that would be deployed.

## 1.4. Training the student with GANs

- GAN can be used to extend PATE as
**PATE-G**. - The
**generator**produces samples from the data distribution by transforming vectors sampled from a Gaussian distribution. - The
**discriminator**is trained to distinguish samples artificially produced by the generator from samples part of the real data distribution. - The
**discriminator**is extended from a binary classifier (data vs. generator sample) to a**multi-class classifier**(one of*k*classes of data samples, plus a class for generated samples).

This discriminator/classifier is then trained to

classify labeled real samples in the correct class, unlabeled real samples in any of thekclasses, and the generated samples in the additional class.

**2. Experimental Results**

- The
**MNIST model**stacks**two convolutional layers with max-pooling**and**one fully connected layer**with ReLUs. For**SVHN**,**two hidden layers are added**. (Not much information about the model architecture in the paper.) - On MNIST and SVHN,
**an ensemble of**is trained. Their aggregated predictions are accurate despite the injection of large amounts of random noise to ensure privacy.*n*=250 teachers - The average test accuracy of
**individual teachers**is**83.86% for MNIST**and**83.18% for SVHN**. - The
**aggregation mechanism**output has an accuracy of**93.18% on MNIST**and**87.79% on SVHN.** - In the above table, each row is a variant of the student model trained with GAN (
**PATE-G**) in a semi-supervised way. - The bound parameter
*ε*and failure probability*δ*are for the (*ε*,*δ*) of differential privacy guarantee. (I do not review this part as I want to focus on the semi-supervised learning only. But if you’re interested, please feel free to read the paper directly.)

The

MNIST studentis able to learn a98%accurate model, which is shy of 1% when compared to the accuracy of a model learned with the entire training set, with only 100 label queries. TheSVHN studentachieves90.66% accuracy, which is also comparable to the 92.80% accuracy of one teacher learned with the entire training set.

## Reference

[2017 ICLR] [PATE & PATE-G]

Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data

[Authors’ Presentation Slides]

## Pretraining or Weakly/Semi-Supervised Learning

**2013** [Pseudo-Label (PL)] **2015** [Ladder Network, Γ-Model] **2016 **[Sajjadi NIPS’16] **2017** [Mean Teacher] [PATE & PATE-G] **2018 **[WSL] **2019 **[Billion-Scale] [Label Propagation] [Rethinking ImageNet Pre-training] **2020 **[BiT] [Noisy Student] [SimCLRv2]