Review — PATE & PATE-G: Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data

PATE — Private Aggregation of Teacher Ensembles: Training Student Using Ensemble of Teachers

5 min readApr 10, 2022

Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data
PATE & PATE-G, by Pennsylvania State University, Google Brain, & Google
2017 ICLR, Over 600 Citations (Sik-Ho Tsang @ Medium)
Semi-Supervised Learning, Teacher Student, Image Classification

Private Aggregation of Teacher Ensembles (PATE) is proposed, in which multiple teachers trained with disjoint datasets. Because they rely directly on sensitive data, these teachers are not published.
Then, the student learns to predict an output chosen by noisy voting among all of the teachers, and cannot directly access an individual teacher or the underlying data or parameters.
This is a paper from Ian Goodfellow, who invented GAN. PATE-G is further proposed which utilize GAN for PATE.

Outline

Private Aggregation of Teacher Ensembles (PATE)
Experimental Results

1. Private Aggregation of Teacher Ensembles (PATE)

1.1. Data Partitioning and Teachers

Instead of training a single model to solve the task associated with dataset (X, Y), where X denotes the set of inputs, and Y the set of labels, the data is partitioned in n disjoint sets (Xn, Yn).

A teacher model is trained separately on each set. n classifiers fi called teachers.

Then they are deployed as an ensemble making predictions on unseen inputs x by querying each teacher for a prediction fi(x).
These predictions are aggregated into a single prediction.

1.2. Aggregation

Let m be the number of classes in the task.
The label counts for each class are count.
Yet, if we simply apply plurality — use the label with the largest count — the ensemble’s decision may depend on a single teacher’s vote.
If two classes have close vote counts, the disagreement
may reveal private information.

Random noise is added to the vote counts nj to introduce ambiguity.

γ is a privacy parameter and Lap(b) is the Laplacian distribution with location 0 and scale b.
The parameter γ influences the privacy guarantee.

Intuitively, a large γ leads to a strong privacy guarantee, but can degrade the accuracy of the labels, as the noisy maximum f above can differ from the true plurality.

Next step is to train another model, the student, using a fixed number of labels predicted by the teacher ensemble.

1.3. Knowledge from an Ensemble to a Student

A student is trained on non-sensitive and unlabeled data which are labeled using the aggregation mechanism.
This student model is the one that would be deployed.

1.4. Training the student with GANs

GAN can be used to extend PATE as PATE-G.
The generator produces samples from the data distribution by transforming vectors sampled from a Gaussian distribution.
The discriminator is trained to distinguish samples artificially produced by the generator from samples part of the real data distribution.
The discriminator is extended from a binary classifier (data vs. generator sample) to a multi-class classifier (one of k classes of data samples, plus a class for generated samples).

This discriminator/classifier is then trained to classify labeled real samples in the correct class, unlabeled real samples in any of the k classes, and the generated samples in the additional class.

2. Experimental Results

The MNIST model stacks two convolutional layers with max-pooling and one fully connected layer with ReLUs. For SVHN, two hidden layers are added. (Not much information about the model architecture in the paper.)
On MNIST and SVHN, an ensemble of n=250 teachers is trained. Their aggregated predictions are accurate despite the injection of large amounts of random noise to ensure privacy.
The average test accuracy of individual teachers is 83.86% for MNIST and 83.18% for SVHN.
The aggregation mechanism output has an accuracy of 93.18% on MNIST and 87.79% on SVHN.
In the above table, each row is a variant of the student model trained with GAN (PATE-G) in a semi-supervised way.
The bound parameter ε and failure probability δ are for the (ε, δ) of differential privacy guarantee. (I do not review this part as I want to focus on the semi-supervised learning only. But if you’re interested, please feel free to read the paper directly.)

The MNIST student is able to learn a 98% accurate model, which is shy of 1% when compared to the accuracy of a model learned with the entire training set, with only 100 label queries. The SVHN student achieves 90.66% accuracy, which is also comparable to the 92.80% accuracy of one teacher learned with the entire training set.