Review — Self-supervised Semi-supervised Learning for Data Labeling and Quality Evaluation

BEiT-Pretrained ViT, Self-Supervised Learning by BYOL, Semi-Supervised Learning by Label Propagation (LP)

Sik-Ho Tsang
7 min readSep 22, 2022

Self-supervised Semi-supervised Learning for Data Labeling and Quality Evaluation,
BYOL+LP, by Apple
2021 NeurIPS DCAI Workshop (

@ Medium)
Self-Supervised Learning, Semi-Supervised Learning, BEiT, BYOL, ViT

  • This paper showcases that self-supervised learning helps reducing annotation cost and increasing annotation quality.
  • A unifying framework is proposed by leveraging self-supervised semi-supervised learning and use it to construct workflows for data labeling and annotation verification tasks.


  1. Overall Workflow
  2. Self-Supervised Learning
  3. Nearest Neighbor Graph
  4. Semi-Supervised Pseudo-Labeling with Label Propagation
  5. Experimental Results

1. Overall Workflow

  • Let assume we have a dataset of n examples X={x1, …, xn} R^(n×d) with d being the feature dimension. And we have the label matrix Y={y1, …, yn}R^(n×c) with c being the number of classes.
  • Within label matrix Y, some are labeled data denoted as l and some are unlabeled data denoted as u.

The goal is to leverage feature matrix X and known label matrix Yl to generate and improve the estimate ~Yu for the unknown label matrix Yu.

  • The general workflow consists of two parts.
  1. First, self-supervised learning is leveraged to obtain an unsupervised representation for the unlabeled data.
  2. Then, a nearest neighbor graph is constructed over data samples based on the learned representations, and soft labels are assigned based on the nearest neighbor graph.
  • Finally, the model is trained on labelled + pseudo-labelled datasets, and the above process can be iterated.

2. Self-Supervised Learning

BYOL Architecture (Figure from BYOL)
  • BYOL is used for self-supervised learning, which has an asymmetric siamese architecture including online encoder fθ, online projector  and predictor  for one branch, and a target encoder fξ and projector  for the other branch. The objective is:
  • where z is the projection of , and , p is the projection of fξ and .
  • After training is completed, is used, and l((xi), (xj)) as a similarity metric between xi and xj.
  • (Please feel free to read BYOL if interested.)

3. Nearest Neighbor Graph

3.1. Preliminaries of Normalized Graph Laplacian

  • (Please skip this sub-section if you know Normalized Graph Laplacian well.)
  • Graph is a node-edge representation, which can have many applications such as social media network.
Graph (Example from Here)
  • Assume we have the above graph with 5 nodes, we will have the below 5×5 adjacent matrix A:
Adjacent Matrix A (Example from Here)
  • Each 1 means the connection/edge between 2 nodes. (Some of the graphs will have value larger than 1, which is the weighting of the edge, which is also the case in this paper.)
  • The degree matrix D of this graph is as follows:
Degree Matrix D (Example from Here)
  • Each value is the sum of each row, which is the number of connections for that node.
  • By D-A, we got Laplacian matrix:
Laplacian Matrix L (Example from Here)
  • Sometimes we would like to normalize it such that the all the diagonal values are 1:
Normalized Graph Laplacian L (Example from Here)

In this paper, instead of using Laplacian matrix for normalization, W is used, which is similar to the adjacent matrix A here, for normalization.

And this W is estimated using the BYOL-pretrained model.

3.2. Nearest Neighbor Graph Using Self-Supervised-Pretrained Model

  • Based on the metric l((xi), (xj)), a nearest neighbor graph can be built in the form of sparse adjacency matrix W:
  • where NN(i, k) denotes the index set of the k nearest neighbors of xi, and T is a temperature parameter.

The above equation generates a graph where each sample (node) i has k connections with the weight Wij while other connections are zero (no linkage).

  • The symmetrically normalized counterpart of W is given by:
  • where:
  • With D is the degree matrix and 1n is an all-ones n-vector.

4. Semi-Supervised Pseudo-Labeling with Label Propagation

  • It is assumed that nearby points/nodes are likely to have the same label.
  • Label Propagation (LP) can be performed on the nearest neighbor graph to propagate information from samples with known labels to samples without label or with noisy labels as follows:
  • where W is the symmetrically normalized Laplacian matrix that just computed, and t is the iteration number.

Thus, for each estimated label ~Y(t+1), it is a weighted sum of other labels from Y(t) using nearest neighbor graph.

  • Finally, the estimated labels at iteration t+1, ~Yu(t+1), are assigned to unlabeled data, i.e. label propagation.
  • For labeled data, they are assigned as Y(0)l, i.e. the ground-truth labels.
  • where Y(0)={y1, .., yn} is the initial label matrix.
  • ~Y(t+1) is the soft label matrix at iteration t+1.

Effectively, in brief, k-NN to get some near points, and zero out other far points, the soft labels are assigned as the weighted sum of k near points based on distances.

5. Experimental Results

5.1. Settings

  • ViT encoder initialized by BEiT is used as backbone.
  • Batch size of 64 with images resized to 224×224 resolution, is used.
  • To construct k-NN graph, k=10 and T=0.01 are used for CIFAR10, and k=15 and T=0.02 are used for CIFAR100.
  • t=20 iterations are used for LP.

5.2. k-NN Classification Performance with Learned Representations

By using BYOL+LP on BEiT-initialized ViT, 98.45% validation Top-1 accuracy on CIFAR10, and 89.58% validation Top-1 accuracy on CIFAR100, are achieved.

5.3. Efficient Annotation with Active Learning

Active Learning Performance on CIFAR10 (left) and CIFAR100 (right)
  • Weighted k-NN classification based on the nearest neighbor graph is directly applied to assess the learned representation quality.
  • Red line: Results from SimSiam+AL.
  • Green line (Proposed): Using both labeled training data and the unlabeled training data with LP generated pseudo-labels to predict validation labels.
  • Simulation is performed using both datasets and start with no training label.
  • The human-in-the-loop annotation process is simulated by iteratively performing LP and randomly sampling data for oracle labeling.

Exponential gain is achieved in Top-1 Accuracy when annotating fewer than < 0.1% data in CIFAR10 and <1% data in CIFAR100.

  • Orange line: Only labeled training data to predict validation labels.
  • An ablation study is performed by performing LP only on the annotated data and the validation data.

Thus, having a reliable nearest neighbor graph allows to effectively scale performance with the amount of unlabeled data by propagating information from labeled data across the data manifold.

5.4. Robust Classification with Noisy Labels

The Noise Level in Pseudo Label vs. Number of LP Iterations
  • Random noise is simulated in labels and then LP is performed.
  • If the proposed nearest neighbor graph faithfully captures the similarity among data, LP will aggregate and smooth out the inconsistency from noisy neighbor labels so that the neighbors with the correct label can stand out.

As illustrated above, with a noise level below 0.8, LP quickly reduces the noise level in pseudo label with an increasing number of iterations.

At extreme noise levels such as 0.9, the consistency assumption no longer holds due to highly corrupted labels, and performing LP hurts performance.

5.5. Classification under Noisy Label

Classification Performance Under Label Noise on CIFAR10 (left) and CIFAR100 (right)
  • After obtaining smoothed pseudo labels via LP, the pseudo labels with the nearest neighbor graph are used to perform weighted k-NN classification.
  • Weighted k-NN based on the corrupted labels are used as a baseline.

As shown in the above figure, pseudo labels obtained via LP offer more robust performance than directly using the corrupted labels on both CIFAR10 and CIFAR100.

This paper proposes to use both self-supervised learning and semi-supervised learning to get a high classification accuracy with small amount (<1%) of labeled data.


[2021 NeurIPS DCAI Workshop] [BYOL+LP]
Self-supervised Semi-supervised Learning for Data Labeling and Quality Evaluation

[Tutorial] [Laplacian Matrix]

1.2. Unsupervised/Self-Supervised Learning

19932021 [MoCo v3] [SimSiam] [DINO] [Exemplar-v1, Exemplar-v2] [MICLe] [Barlow Twins] [MoCo-CXR] [W-MSE] [SimSiam+AL] [BYOL+LP] 2022 [BEiT]

1.3. Pretraining or Weakly/Semi-Supervised Learning

2004 … 2021 [Curriculum Labeling (CL)] [Su CVPR’21] [Exemplar-v1, Exemplar-v2] [SimPLE] [BYOL+LP]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.