Oct 23, 2021

# Review — FaceNet: A Unified Embedding for Face Recognition and Clustering

## Using Triplet Loss for Contrastive Learning

In this story, **FaceNet: A Unified Embedding for Face Recognition and Clustering**, by Google, is reviewed. In this paper:

**A mapping from face images to a compact Euclidean space**is learned where distances directly correspond to a measure of face similarity.

This is a paper in **2015 CVPR **with over **8900 citations**. (Sik-Ho Tsang @ Medium) The triplet loss is a **contrastive learning** which is a useful concept in **self-supervised learning**.

# Outline

**FaceNet Framework****Triplet Loss****Network Architecture****Experimental Results**

**1. FaceNet Framework**

- The input batch is the batch of face images. Thus, with
**an image***x*,**an embedding**in a feature space is obtained.*f*(*x*) - Deep architecture is either modified ZFNet or GoogLeNet / Inception-v1, which will be mentioned more in Section 3.

With

triplet lossto train the network end-to-end, such that thesquared distancebetween all faces, independent of imaging conditions, of thesame identityissmall, whereas the squared distance between a pair of face images fromdifferent identitiesislarge.

**2. Triplet Loss**

## 2.1. Loss Function

- The Triplet Loss minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity.
**The embedding**is constrained to live on the*f*(*x*), i.e.*d*-dimensional hypersphere**||f(x)||2 = 1**.- As shown above,
**an image**of a specific person is closer to all other images*xai*(anchor)*xpi*(positive)**same person**than it is to any**image**of*xni*(negative)**any other person**. Thus, we want (eq.(1)):

- where
is a*α=0.2***margin**that is enforced between positive and negative pairs.*T***set of all possible triplets**in the training set and has**cardinality**.*N* **The loss**that is being minimized is:*L*

## 2.2. Triplet Selection

- Generating all possible triplets would result in
**many triplets**that are**not contribute to the training**and result in**slower convergence**. - It is crucial to
**select hard triplets**, that are active and can therefore**contribute to improving the model**. - In order to ensure fast convergence it is crucial to
**select triplets that violate the triplet constraint (eq.(1))**.

These negative exemplarsare calledsemi-hard, as they arefurther away from the anchor than the positive exemplar, but still hard because the squared distance isclose to the anchor-positive distance. Those negativeslie inside the margin.

# 3. Network Architecture

## 3.1. NN1 (Modified ZFNet)

- The above table shows the structure of the proposed ZFNet-based model with 1×1 convolutions.
- It has a total of
**140 million parameters**and requires around**1.6 billion FLOPS per image**.

## 3.2. NN2, NN3, NN4, NNS1, NNS2 (GoogLeNet / Inception-v1)

- The models based on GoogLeNet / Inception-v1 have
**20× fewer parameters**(around 6.6M-7.5M) and up to**5× fewer FLOPS**(between 500M-1.6B). **NN3**is identical in architecture but has a reduced input size of 160×160.**NN4**has an input size of only 96×96, thereby drastically reducing the CPU requirements (285M FLOPS vs 1.6B for NN2).**NNS1 (mini-Inception)**has 26M parameters and only requires 220M FLOPS per image.**NNS2 (tiny-Inception)**has 4.3M parameters and 20M FLOPS.- (Certainly, better the network, better the performance. But the triplet loss is the main point in this paper.)

# 4. Experimental Results

## 4.1. Effect of CNN Model

**GoogLeNet / Inception-v1****based models**, such as NN3, still achieve**good performance**while**significantly reducing both the FLOPS and the model size.**- While the largest model achieves a dramatic improvement in accuracy compared to the
**tiny NNS2**, the latter can be run**30ms / image**on a mobile phone. - The sharp drop in the ROC for FAR < 10^-4 indicates noisy labels in the test data groundtruth.

## 4.2. Sensitivity to Image Quality

- The network is surprisingly robust with respect to JPEG compression and performs very well down to a JPEG quality of 20.
**The performance drop is very small**for face thumbnails down to a size of**120×120**pixels and even at**80×80**pixels it shows**acceptable performance**.

## 4.3. Embedding Dimensionality

- Feature with
**128 dimensions**are used as it got the best results.

## 4.4. Amount of Training Data

- Compared to only millions of images, the
**relative reduction in error is****60%**.**A clear boost of accuracy**is observed.

## 4.5. Performance on LFW Dataset

- The above figure shows all pairs of images that were
**incorrectly classified on LFW**.

A classification accuracy of

98.87%±0.15whenusing the fixed center cropand the record breaking99.63%±0.09standard error of the mean whenusing the extra face alignment.This reduces the error reported for DeepFace by more than a factor of 7 and the previous state-of-the-art reported for DeepId2+ by 30%.

- This is the performance of model
**NN1**, but even the much smaller NN3 achieves performance that is not statistically significantly different.

## 4.6. Performance on YouTube Faces DB Dataset

- YouTube Faces DB is a face video dataset.

A classification accuracy of

95.12%±0:39is achieved.

- Using the
**first one thousand frames**results in**95.18%**.

Compared to

DeepFace 91.4%who also evaluate one hundred frames per video, FaceNet reduces the error rate by almost half.

DeepId2+achieved93.2%and FaceNet reduces this error by 30%, comparable to our improvement on LFW.

## 4.7. Face Clustering

- The compact embedding lends itself to be used in order to
**cluster a users personal photos into groups of people with the same identity**. - It is a clear showcase of the incredible invariance to occlusion, lighting, pose and even age.

## 4.8. Harmonic Embedding

- There is
**a set of embeddings**that are generated by**different models v1 and v2**but are compatible in the sense that they can be compared to each other. This compatibility greatly simplifies upgrade paths. - An scenario where embedding v1 was computed across a large set of images and a new embedding model v2 is being rolled out, this compatibility ensures a smooth transition without the need to worry about version incompatibilities.
- The vast majority of v2 embeddings may be embedded near the corresponding v1 embedding, however, incorrectly placed v1 embeddings can be perturbed slightly such that their new location in embedding space improves verification accuracy.
- (This is just a conceptual idea in Appendix in the paper.)

The important part to me is the **triplet loss** which can be useful for learning **the contrastive learning**. Hope I can write about DeepID2 and DeepFace later on.

## Reference

[2015 CVPR] [FaceNet]

FaceNet: A Unified Embedding for Face Recognition and Clustering

## Face Recognition

**2005** [Chopra CVPR’05] **2015 **[FaceNet]