Review — DeepFace: Closing the Gap to Human-Level Performance in Face Verification

DeepFace for Face Verification After Face Alignment

this paper, DeepFace: Closing the Gap to Human-Level Performance in Face Verification, (DeepFace), by Facebook AI Research, and Tel Aviv University, is briefly reviewed. In this paper:

  • Face image is under 2D & 3D face alignments.
  • Then the aligned face is input to DeepFace network for face verification.

This is a paper in 2014 CVPR with over 5900 citations. (Sik-Ho Tsang @ Medium)

Outline

  1. Face Alignment
  2. DeepFace: Network Architecture
  3. Experimental Results

1. Face Alignment

Alignment Pipeline from (a) to (g)
  • (a): The detected face, with 6 initial fiducial points.
  • (b): The induced 2D-aligned crop.
  • (c): 67 fiducial points on the 2D-aligned crop with their corresponding Delaunay triangulation, triangles are added on the contour to avoid discontinuities.
  • (d): The reference 3D shape transformed to the 2D-aligned crop image-plane.
  • (e): Triangle visibility w.r.t. to the fitted 3D-2D camera; darker triangles are less visible.
  • (f): The 67 fiducial points induced by the 3D model that are used to direct the piece-wise affine warpping.
  • (g): The final frontalized crop.
  • (h): A new view generated by the 3D model (not used in this paper).

Analytical 3D modeling of the face is proposed that based on fiducial points, which is used to warp a detected facial crop to a 3D frontal mode.

2. DeepFace: Network Architecture

DeepFace: Network Architecture

2.1. Network

  • Input: 3D-aligned 3-channels (RGB) face image of size 152 by 152 pixels.
  • C1: A convolutional layer with 32 filters of size 11×11×3.
  • M2: A 3×3 max pooling layer with a stride of 2.
  • C3: Another convolutional layer that has 16 filters of size 9×9×16.
  • The purpose of these three layers is to extract low-level features, like simple edges and texture.
  • L4, L5, L6: These subsequent layers are locally connected, like a convolutional layer they apply a filter bank, but every location in the feature map learns a different set of filters, since different regions of an aligned image have different local statistics.
  • For example, areas between the eyes and the eyebrows exhibit very different appearance and have much higher discrimination ability compared to areas between the nose and the mouth.
  • In other words, the architecture of the DNN is customized by leveraging the fact that the input images are aligned.
  • F7, F8: Finally, these top two layers are fully connected. These layers are able to capture correlations between features captured in distant parts of the face images, e.g., position and shape of eyes and position and shape of mouth.
  • ReLU is used.
  • Dropout is applied only to the first fully-connected layer.

2.2. Training Loss

  • The output of the last fully-connected layer is fed to a K-way softmax (where K is the number of classes) which produces a distribution over the class labels.
  • Thus, the probability assigned to the k-th class is the output of the softmax function:
  • The goal of training is to maximize the probability of the correct class (face id). Thus, cross-entropy loss is used.
  • Given an image I, the representation G(I) is then computed using the described feed-forward network. This G(I) is then normalized.

3. Experimental Results

3.1. LFW Dataset

Comparison with the state-of-the-art on the LFW dataset
The ROC curves on the LFW dataset
  • DeepFace-single obtains 95.92% to 97.00% accuracy. (Unsupervised meaning training is performed on the other dataset. And no fine-tuning.)
  • With 4 DeepFace-single models ensembled, DeepFace-ensemble obtains 97.15% to 97.35% accuracy, outperforms other SOTA approaches.

3.2. YTF Dataset

Comparison with the state-of-the-art on the YTF dataset
The ROC curves on the YTF dataset
  • An accuracy of 91.4% is obtained, which reduces the error of the previous best methods by more than 50%.
  • Note that there are about 100 wrong labels for video pairs, recently updated to the YTF webpage. After these are corrected, DeepFace-single actually reaches 92.5%.

3.3. Computational Efficiency

  • Using a single core Intel 2.2GHz CPU, the operator takes 0.18 seconds to extract features from the raw input pixels.
  • Efficient warping techniques were implemented for alignment; alignment alone takes about 0.05 seconds.
  • Overall, the DeepFace runs at 0.33 seconds per image.

PhD, Researcher. I share what I've learnt and done. :) My LinkedIn: https://www.linkedin.com/in/sh-tsang/, My Paper Reading List: https://bit.ly/33TDhxG