Review — PATE & PATE-G: Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data

PATE — Private Aggregation of Teacher Ensembles: Training Student Using Ensemble of Teachers

  • Private Aggregation of Teacher Ensembles (PATE) is proposed, in which multiple teachers trained with disjoint datasets. Because they rely directly on sensitive data, these teachers are not published.
  • Then, the student learns to predict an output chosen by noisy voting among all of the teachers, and cannot directly access an individual teacher or the underlying data or parameters.
  • This is a paper from Ian Goodfellow, who invented GAN. PATE-G is further proposed which utilize GAN for PATE.


  1. Private Aggregation of Teacher Ensembles (PATE)
  2. Experimental Results

1. Private Aggregation of Teacher Ensembles (PATE)

Private Aggregation of Teacher Ensembles (PATE)

1.1. Data Partitioning and Teachers

Data Partitioning and Teachers
  • Instead of training a single model to solve the task associated with dataset (X, Y), where X denotes the set of inputs, and Y the set of labels, the data is partitioned in n disjoint sets (Xn, Yn).
  • Then they are deployed as an ensemble making predictions on unseen inputs x by querying each teacher for a prediction fi(x).
  • These predictions are aggregated into a single prediction.

1.2. Aggregation

  • Let m be the number of classes in the task.
  • The label counts for each class are count.
  • Yet, if we simply apply plurality — use the label with the largest count — the ensemble’s decision may depend on a single teacher’s vote.
  • If two classes have close vote counts, the disagreement
    may reveal private information.
  • γ is a privacy parameter and Lap(b) is the Laplacian distribution with location 0 and scale b.
  • The parameter γ influences the privacy guarantee.
  • Next step is to train another model, the student, using a fixed number of labels predicted by the teacher ensemble.

1.3. Knowledge from an Ensemble to a Student

  • A student is trained on non-sensitive and unlabeled data which are labeled using the aggregation mechanism.
  • This student model is the one that would be deployed.

1.4. Training the student with GANs

  • GAN can be used to extend PATE as PATE-G.
  • The generator produces samples from the data distribution by transforming vectors sampled from a Gaussian distribution.
  • The discriminator is trained to distinguish samples artificially produced by the generator from samples part of the real data distribution.
  • The discriminator is extended from a binary classifier (data vs. generator sample) to a multi-class classifier (one of k classes of data samples, plus a class for generated samples).

2. Experimental Results

Student Accuracy on MNIST and SVHN
  • The MNIST model stacks two convolutional layers with max-pooling and one fully connected layer with ReLUs. For SVHN, two hidden layers are added. (Not much information about the model architecture in the paper.)
  • On MNIST and SVHN, an ensemble of n=250 teachers is trained. Their aggregated predictions are accurate despite the injection of large amounts of random noise to ensure privacy.
  • The average test accuracy of individual teachers is 83.86% for MNIST and 83.18% for SVHN.
  • The aggregation mechanism output has an accuracy of 93.18% on MNIST and 87.79% on SVHN.
  • In the above table, each row is a variant of the student model trained with GAN (PATE-G) in a semi-supervised way.
  • The bound parameter ε and failure probability δ are for the (ε, δ) of differential privacy guarantee. (I do not review this part as I want to focus on the semi-supervised learning only. But if you’re interested, please feel free to read the paper directly.)



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store