Review — PATE & PATE-G: Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data
PATE — Private Aggregation of Teacher Ensembles: Training Student Using Ensemble of Teachers
Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data
PATE & PATE-G, by Pennsylvania State University, Google Brain, & Google
2017 ICLR, Over 600 Citations (Sik-Ho Tsang @ Medium)
Semi-Supervised Learning, Teacher Student, Image Classification
- Private Aggregation of Teacher Ensembles (PATE) is proposed, in which multiple teachers trained with disjoint datasets. Because they rely directly on sensitive data, these teachers are not published.
- Then, the student learns to predict an output chosen by noisy voting among all of the teachers, and cannot directly access an individual teacher or the underlying data or parameters.
- This is a paper from Ian Goodfellow, who invented GAN. PATE-G is further proposed which utilize GAN for PATE.
Outline
- Private Aggregation of Teacher Ensembles (PATE)
- Experimental Results
1. Private Aggregation of Teacher Ensembles (PATE)
1.1. Data Partitioning and Teachers
- Instead of training a single model to solve the task associated with dataset (X, Y), where X denotes the set of inputs, and Y the set of labels, the data is partitioned in n disjoint sets (Xn, Yn).
A teacher model is trained separately on each set. n classifiers fi called teachers.
- Then they are deployed as an ensemble making predictions on unseen inputs x by querying each teacher for a prediction fi(x).
- These predictions are aggregated into a single prediction.
1.2. Aggregation
- Let m be the number of classes in the task.
- The label counts for each class are count.
- Yet, if we simply apply plurality — use the label with the largest count — the ensemble’s decision may depend on a single teacher’s vote.
- If two classes have close vote counts, the disagreement
may reveal private information.
Random noise is added to the vote counts nj to introduce ambiguity.
- γ is a privacy parameter and Lap(b) is the Laplacian distribution with location 0 and scale b.
- The parameter γ influences the privacy guarantee.
Intuitively, a large γ leads to a strong privacy guarantee, but can degrade the accuracy of the labels, as the noisy maximum f above can differ from the true plurality.
- Next step is to train another model, the student, using a fixed number of labels predicted by the teacher ensemble.
1.3. Knowledge from an Ensemble to a Student
- A student is trained on non-sensitive and unlabeled data which are labeled using the aggregation mechanism.
- This student model is the one that would be deployed.
1.4. Training the student with GANs
- GAN can be used to extend PATE as PATE-G.
- The generator produces samples from the data distribution by transforming vectors sampled from a Gaussian distribution.
- The discriminator is trained to distinguish samples artificially produced by the generator from samples part of the real data distribution.
- The discriminator is extended from a binary classifier (data vs. generator sample) to a multi-class classifier (one of k classes of data samples, plus a class for generated samples).
This discriminator/classifier is then trained to classify labeled real samples in the correct class, unlabeled real samples in any of the k classes, and the generated samples in the additional class.
2. Experimental Results
- The MNIST model stacks two convolutional layers with max-pooling and one fully connected layer with ReLUs. For SVHN, two hidden layers are added. (Not much information about the model architecture in the paper.)
- On MNIST and SVHN, an ensemble of n=250 teachers is trained. Their aggregated predictions are accurate despite the injection of large amounts of random noise to ensure privacy.
- The average test accuracy of individual teachers is 83.86% for MNIST and 83.18% for SVHN.
- The aggregation mechanism output has an accuracy of 93.18% on MNIST and 87.79% on SVHN.
- In the above table, each row is a variant of the student model trained with GAN (PATE-G) in a semi-supervised way.
- The bound parameter ε and failure probability δ are for the (ε, δ) of differential privacy guarantee. (I do not review this part as I want to focus on the semi-supervised learning only. But if you’re interested, please feel free to read the paper directly.)
The MNIST student is able to learn a 98% accurate model, which is shy of 1% when compared to the accuracy of a model learned with the entire training set, with only 100 label queries. The SVHN student achieves 90.66% accuracy, which is also comparable to the 92.80% accuracy of one teacher learned with the entire training set.
Reference
[2017 ICLR] [PATE & PATE-G]
Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data
[Authors’ Presentation Slides]
Pretraining or Weakly/Semi-Supervised Learning
2013 [Pseudo-Label (PL)] 2015 [Ladder Network, Γ-Model] 2016 [Sajjadi NIPS’16] 2017 [Mean Teacher] [PATE & PATE-G] 2018 [WSL] 2019 [Billion-Scale] [Label Propagation] [Rethinking ImageNet Pre-training] 2020 [BiT] [Noisy Student] [SimCLRv2]