# Review — Self-supervised Semi-supervised Learning for Data Labeling and Quality Evaluation

## BEiT-Pretrained ViT, Self-Supervised Learning by **BYOL**, Semi-Supervised Learning by Label Propagation (LP)

Self-supervised Semi-supervised Learning for Data Labeling and Quality Evaluation,BYOL+LP, by Apple2021 NeurIPS DCAI Workshop(Sik-Ho Tsang @ Medium)

Self-Supervised Learning, Semi-Supervised Learning, BEiT, BYOL, ViT

- This paper showcases that self-supervised learning helps reducing annotation cost and increasing annotation quality.
**A unifying framework**is proposed by**leveraging self-supervised semi-supervised learning**and use it to construct workflows for data labeling and annotation verification tasks.

# Outline

**Overall Workflow****Self-Supervised Learning****Nearest Neighbor Graph****Semi-Supervised Pseudo-Labeling with Label Propagation****Experimental Results**

# 1. Overall Workflow

- Let assume we have a
**dataset**of∈*n*examples*X*={*x*1, …,*xn*}*R^*(*n*×*d*) withbeing the*d***feature dimension**. And we have the**label matrix**∈*Y*={*y*1, …,*yn*}*R^*(*n*×*c*) with*c***number of classes.** **Within label matrix Y, some**are**labeled data**denoted asand*l***some**are**unlabeled data**denoted as*u*

The goal is to

leverage feature matrixto generate andXand known label matrixYlimprove the estimate ~Yufor the unknown label matrixYu.

- The general workflow consists of
**two parts.**

- First,
**self-supervised learning**is leveraged to**obtain an unsupervised representation for the unlabeled data.** - Then,
**a nearest neighbor graph is constructed over data samples based on the learned representations, and soft labels are assigned based on the nearest neighbor graph**.

- Finally, the model is trained on labelled + pseudo-labelled datasets, and the above process can be iterated.

# 2. Self-Supervised Learning

**BYOL****asymmetric siamese architecture**including**online encoder***f*,*θ***online projector**and*gθ***predictor**for*qθ***one branch**, and a**target encoder***f**ξ*and**projector**for the*gξ***other branch**. The objective is:

- where
is the projection of*z*,*fθ*and*gθ*,*qθ*is the projection of*p**f**ξ and*.*gξ* - After training is completed,
, and*fθ*is usedas a*l*(*fθ*(*xi*),*fθ*(*xj*))**similarity metric between**.*xi*and*xj* - (Please feel free to read BYOL if interested.)

# 3. Nearest Neighbor Graph

## 3.1. Preliminaries of Normalized Graph Laplacian

- (Please skip this sub-section if you know Normalized Graph Laplacian well.)
**Graph**is a**node**-**edge representation**, which can have many applications such as social media network.

- Assume we have the above
**graph with 5 nodes**, we will have the below**5×5 adjacent matrix**:*A*

**Each 1**means the**connection/edge**between 2 nodes. (Some of the graphs will have value larger than 1, which is the weighting of the edge, which is also the case in this paper.)- The
**degree matrix**of this graph is as follows:*D*

- Each value is the sum of each row, which is
**the number of connections for that node.** - By
, we got*D*-*A***Laplacian matrix**:

- Sometimes we would like to
**normalize**it such that the**all the diagonal values are 1:**

- In this example, the
**normalized graph Laplacian**

- (I also wrote a story for Normalized Graph Laplacian. Please feel free to visit if interested.)

In this paper, instead of using Laplacian matrix for normalization,

Wis used, which is similar to the adjacent matrixAhere, for normalization.And this

Wis estimated using the BYOL-pretrained model.

## 3.2. Nearest Neighbor Graph Using Self-Supervised-Pretrained Model

- Based on the metric
, a*l*(*fθ*(*xi*),*fθ*(*xj*))**nearest neighbor graph can be built**in the form of**sparse adjacency matrix**:*W*

- where
*NN*(*i*,*k*)**index set of the**, and*k*nearest neighbors of*xi*is a*T***temperature**parameter.

The above equation generates

a graphwhereeach sample (node)hasiwhile other connections are zero (no linkage).kconnections with the weightWij

- The
**symmetrically normalized counterpart of**is given by:*W*

- where:

- With
is the*D***degree matrix**and**1**is an*n***all-ones**.*n*-vector

# 4. Semi-Supervised Pseudo-Labeling with Label Propagation

- It is assumed that
**nearby points/nodes**are likely to have**the same label**. **Label Propagation (LP)**can be**performed on the nearest neighbor graph**to**propagate information from samples with known labels**to samples without label or with noisy labels as follows:

- where
is the*W***symmetrically normalized Laplacian matrix**that just computed, and*t*is the iteration number.

Thus,

for each estimated label ~, it isY(t+1)a weighted sum of other labels fromY(t) using nearest neighbor graph.

- Finally,
**the estimated labels at iteration**, i.e. label propagation.*t*+1, ~*Yu*(*t*+1), are assigned to unlabeled data - For
**labeled data**, they are assigned as, i.e. the*Y*(0)*l***ground-truth labels**.

- where
={*Y*(0)*y*1, ..,*yn*} is the**initial label matrix**. **~**is the*Y*(*t*+1)**soft label matrix at iteration**.*t*+1

Effectively, in brief,k-NN to get some near points, and zero out other far points, the soft labels are assigned as the weighted sum ofknear points based on distances.

- (The above label propagation is different from the Label Propagation in 2019 CVPR.)

# 5. Experimental Results

## 5.1. Settings

**ViT****encoder initialized by****BEiT****Batch size of 64**with images resized to**224×224 resolution**, is used.- To construct
*k*-NN graph,and*k*=10are used for*T*=0.01**CIFAR10**, andand*k*=15are used for*T*=0.02**CIFAR100**. are used for LP.*t*=20 iterations

## 5.2. k-NN Classification Performance with Learned Representations

By using

BYOL+LP onBEiT-initializedViT, 98.45%validation Top-1 accuracy onCIFAR10, and89.58%validation Top-1 accuracy onCIFAR100, are achieved.

## 5.3. Efficient Annotation with Active Learning

**Weighted**is directly applied to assess the learned representation quality.*k*-NN classification based on the nearest neighbor graph**Red line**: Results from**SimSiam+AL**.**Green line (Proposed)**: Using both**labeled training data**and the**unlabeled training data with LP generated pseudo-labels**to predict validation labels.- Simulation is performed using both datasets and start with no training label.
**The human-in-the-loop annotation process is simulated by iteratively performing LP**and randomly sampling data for oracle labeling.

Exponential gainis achieved in Top-1 Accuracy whenannotating fewer than < 0.1% data in CIFAR10and<1% data in CIFAR100.

**Orange line: Only labeled training data**to predict validation labels.- An
**ablation study**is performed by performing LP only on the annotated data and the validation data.

Thus, having a reliable

nearest neighbor graphallows toeffectively scale performancewith the amount of unlabeled data bypropagating information from labeled dataacross the data manifold.

## 5.4. Robust Classification with Noisy Labels

**Random noise is simulated in labels**and then LP is performed.- If the proposed nearest neighbor graph faithfully captures the similarity among data,
**LP will aggregate and smooth out the inconsistency from noisy neighbor labels**so that the neighbors with the correct label can stand out.

As illustrated above, with a

noise level below 0.8, LP quickly reduces the noise level in pseudo labelwith an increasing number of iterations.At

extreme noise levels such as 0.9, the consistency assumptionno longer holdsdue tohighly corrupted labels, and performingLP hurts performance.

## 5.5. Classification under Noisy Label

- After obtaining smoothed pseudo labels via LP, the pseudo labels with the nearest neighbor graph are used to perform
**weighted**.*k*-NN classification - Weighted
*k*-NN based on the corrupted labels are used as a baseline.

As shown in the above figure,

pseudo labels obtained via LP offer more robust performancethan directly using the corrupted labels on both CIFAR10 and CIFAR100.

This paper proposes to use both self-supervised learning and semi-supervised learning to get a high classification accuracy with small amount (<1%) of labeled data.

## References

[2021 NeurIPS DCAI Workshop] [BYOL+LP]

Self-supervised Semi-supervised Learning for Data Labeling and Quality Evaluation

[Tutorial] [Laplacian Matrix]

https://www.sharetechnote.com/html/Handbook_EngMath_GraphTheory_LaplacianMatrix.html

## 1.2. Unsupervised/Self-Supervised Learning

**1993** … **2021** [MoCo v3] [SimSiam] [DINO] [Exemplar-v1, Exemplar-v2] [MICLe] [Barlow Twins] [MoCo-CXR] [W-MSE] [SimSiam+AL] [BYOL+LP] **2022** [BEiT]

## 1.3. Pretraining or Weakly/Semi-Supervised Learning

**2004 … 2021 **[Curriculum Labeling (CL)] [Su CVPR’21] [Exemplar-v1, Exemplar-v2] [SimPLE] [BYOL+LP]