Review — FaceNet: A Unified Embedding for Face Recognition and Clustering

Using Triplet Loss for Contrastive Learning

Sik-Ho Tsang
6 min readOct 23, 2021
Triplet Loss (Figure from https://conglang.github.io/2018/07/31/essay-facenet/)

In this story, FaceNet: A Unified Embedding for Face Recognition and Clustering, by Google, is reviewed. In this paper:

  • A mapping from face images to a compact Euclidean space is learned where distances directly correspond to a measure of face similarity.

This is a paper in 2015 CVPR with over 8900 citations. (Sik-Ho Tsang @ Medium) The triplet loss is a contrastive learning which is a useful concept in self-supervised learning.

Outline

  1. FaceNet Framework
  2. Triplet Loss
  3. Network Architecture
  4. Experimental Results

1. FaceNet Framework

FaceNet: Framework
  • The input batch is the batch of face images. Thus, with an image x, an embedding f(x) in a feature space is obtained.
  • Deep architecture is either modified ZFNet or GoogLeNet / Inception-v1, which will be mentioned more in Section 3.

With triplet loss to train the network end-to-end, such that the squared distance between all faces, independent of imaging conditions, of the same identity is small, whereas the squared distance between a pair of face images from different identities is large.

2. Triplet Loss

Triplet Loss

2.1. Loss Function

  • The Triplet Loss minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity.
  • The embedding f(x) is constrained to live on the d-dimensional hypersphere, i.e. ||f(x)||2 = 1.
  • As shown above, an image xai (anchor) of a specific person is closer to all other images xpi (positive) of the same person than it is to any image xni (negative) of any other person. Thus, we want (eq.(1)):
  • where α=0.2 is a margin that is enforced between positive and negative pairs. T is the set of all possible triplets in the training set and has cardinality N.
  • The loss L that is being minimized is:

2.2. Triplet Selection

  • Generating all possible triplets would result in many triplets that are not contribute to the training and result in slower convergence.
  • It is crucial to select hard triplets, that are active and can therefore contribute to improving the model.
  • In order to ensure fast convergence it is crucial to select triplets that violate the triplet constraint (eq.(1)).

These negative exemplars are called semi-hard, as they are further away from the anchor than the positive exemplar, but still hard because the squared distance is close to the anchor-positive distance. Those negatives lie inside the margin.

Same and a different person in different pose and illumination combinations (Larger the number, more different the person)

3. Network Architecture

3.1. NN1 (Modified ZFNet)

NN1 (Modified ZFNet)
  • The above table shows the structure of the proposed ZFNet-based model with 1×1 convolutions.
  • It has a total of 140 million parameters and requires around 1.6 billion FLOPS per image.

3.2. NN2, NN3, NN4, NNS1, NNS2 (GoogLeNet / Inception-v1)

NN2 (GoogLeNet / Inception-v1)
  • The models based on GoogLeNet / Inception-v1 have 20× fewer parameters (around 6.6M-7.5M) and up to 5× fewer FLOPS (between 500M-1.6B).
  • NN3 is identical in architecture but has a reduced input size of 160×160.
  • NN4 has an input size of only 96×96, thereby drastically reducing the CPU requirements (285M FLOPS vs 1.6B for NN2).
  • NNS1 (mini-Inception) has 26M parameters and only requires 220M FLOPS per image.
  • NNS2 (tiny-Inception) has 4.3M parameters and 20M FLOPS.
  • (Certainly, better the network, better the performance. But the triplet loss is the main point in this paper.)

4. Experimental Results

4.1. Effect of CNN Model

ROC for the four different models
Mean validation rate VAL at 10E-3 false accept rate (FAR)
  • GoogLeNet / Inception-v1 based models, such as NN3, still achieve good performance while significantly reducing both the FLOPS and the model size.
  • While the largest model achieves a dramatic improvement in accuracy compared to the tiny NNS2, the latter can be run 30ms / image on a mobile phone.
  • The sharp drop in the ROC for FAR < 10^-4 indicates noisy labels in the test data groundtruth.

4.2. Sensitivity to Image Quality

Effect of Image Quality and Image Size
  • The network is surprisingly robust with respect to JPEG compression and performs very well down to a JPEG quality of 20.
  • The performance drop is very small for face thumbnails down to a size of 120×120 pixels and even at 80×80 pixels it shows acceptable performance.

4.3. Embedding Dimensionality

NN1 on the hold-out set
  • Feature with 128 dimensions are used as it got the best results.

4.4. Amount of Training Data

  • Compared to only millions of images, the relative reduction in error is 60%. A clear boost of accuracy is observed.

4.5. Performance on LFW Dataset

Failure Cases on LFW
  • The above figure shows all pairs of images that were incorrectly classified on LFW.

A classification accuracy of 98.87%±0.15 when using the fixed center crop and the record breaking 99.63%±0.09 standard error of the mean when using the extra face alignment.

This reduces the error reported for DeepFace by more than a factor of 7 and the previous state-of-the-art reported for DeepId2+ by 30%.

  • This is the performance of model NN1, but even the much smaller NN3 achieves performance that is not statistically significantly different.

4.6. Performance on YouTube Faces DB Dataset

  • YouTube Faces DB is a face video dataset.

A classification accuracy of 95.12%±0:39 is achieved.

  • Using the first one thousand frames results in 95.18%.

Compared to DeepFace 91.4% who also evaluate one hundred frames per video, FaceNet reduces the error rate by almost half.

DeepId2+ achieved 93.2% and FaceNet reduces this error by 30%, comparable to our improvement on LFW.

4.7. Face Clustering

One Example of Cluster for Face Clustering
  • The compact embedding lends itself to be used in order to cluster a users personal photos into groups of people with the same identity.
  • It is a clear showcase of the incredible invariance to occlusion, lighting, pose and even age.

4.8. Harmonic Embedding

Harmonic Embedding Space
  • There is a set of embeddings that are generated by different models v1 and v2 but are compatible in the sense that they can be compared to each other. This compatibility greatly simplifies upgrade paths.
  • An scenario where embedding v1 was computed across a large set of images and a new embedding model v2 is being rolled out, this compatibility ensures a smooth transition without the need to worry about version incompatibilities.
  • The vast majority of v2 embeddings may be embedded near the corresponding v1 embedding, however, incorrectly placed v1 embeddings can be perturbed slightly such that their new location in embedding space improves verification accuracy.
  • (This is just a conceptual idea in Appendix in the paper.)

The important part to me is the triplet loss which can be useful for learning the contrastive learning. Hope I can write about DeepID2 and DeepFace later on.

Reference

[2015 CVPR] [FaceNet]
FaceNet: A Unified Embedding for Face Recognition and Clustering

Face Recognition

2005 [Chopra CVPR’05] 2015 [FaceNet]

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.