Review — Learning Face Representation from Scratch

CASIANet is designed, CASIA-WebFace Dataset is Constructed, Outperforms DeepFace & DeepID2

Sik-Ho Tsang
5 min readDec 22, 2021
Face Recognition (Image from Freepik)

Learning Face Representation from Scratch
CASIANet, by Institute of Automation, Chinese Academy of Sciences (CASIA)
2014 arXiv, over 1500 citations (

@ Medium)
Face Recognition, Representation Learning, Contrastive Learning

  • Data is more important than algorithm. To solve this problem, this paper proposes a semi-automatic way to collect face images from Internet.
  • A large scale dataset, called CASIA-WebFace, is built, which contains about 10,000 subjects and 500,000 images.
  • A 11-layer CNN is also constructed for face recognition.


  1. CASIA-WebFace Dataset
  2. CASIANet: Network Architecture
  3. Experimental Results

1. CASIA-WebFace Dataset

1.1. Face Image Source

A sample page of David Fincher on IMDb
  • IMDb is a well structured website containing rich information of celebrities, such as name, gender, birthday and photos.
  • Each celebrity has an independent page on the website. A sample page is as shown above, in which we only focus on the “name”, “main photo” and “photo gallery” contents.
  • Neglecting the celebrities don’t having “main photo”, there are 38,423 subjects and 903,304 images in total.
  • Then all images are processed by a multi-view face detector, 844,126 images remain in the dataset and 1,556,352 faces are detected.
  • Because many images appear in the “photo gallery” of serval celebrities simultaneously, the actual number of images and faces are smaller than above numbers.

1.2. Identity Assignment

Two sample photos of Ben Affleck containing multiple faces
  • The goal is to assign an identity to each face, and divide the faces into groups according to their identities.
  1. Extract the feature template of each face by a pretrained face recognition engine [29].
  2. Use the “main photo” of each celebrity as its seed.
  3. Use the images containing 1 face to augment each celebrity’s seeding images.
  4. For the remain images in “photo gallery”, find the correspondence between faces and celebrities constrained by similarity and name tag.
  5. Crop faces from images and save into independent folder for each celebrity. Manually check the dataset and delete the false grouped face images.
  6. Remove the subjects having less than 15 face images.
  7. Check the duplicate subjects based on edit distance between the names in CASIA-WebFace and LFW.
  8. There are 1043 subjects with the same names are found between CASIA-WebFace and LFW, and these subjects are removed from CASIA-WebFace.
  • CASIA-WebFace now can be treated as an independent training set for LFW. And it contains 10,575 subjects and 494,414 face images.
  • 493,456 detected faces are mirrored to augment the dataset. 986,912 training samples are obtained.

2. CASIANet: Network Architecture

CASIANet: Network Architecture

1.1. Network

  • The dimension of input layer is 100×100×1 channel, i.e., gray image.
  • The proposed network includes 10 convolutional layer, 5 pooling layers and 1 fully connected layers.
  • The size of all filters in the network are 3×3.
  • The first four pooling layers use max operator and the last pooling layer is average.

2.2. Face-Related Stuffs

Softmax (identification) and Contrastive (verification) costs
  • Pool5 layer is used as face representation, the dimension of which is equal to the number of channels of Conv52, 320.
  • To distinguish large number of subjects in the training set (10,575), this low dimensional representation should fully distill discriminative information from face images.
  • As same as DeepID2, Softmax (identification) and Contrastive (verification) costs are combined to construct the objective function.
  • ReLU neuron is used after all convolutional layers, except for Conv52. Because
  • Conv52 are combined by average to generate the low dimensional face representation, they should be dense and compact.
  • In the training stage, Pool5 is used as input of Contrastive cost function. And Fc6 is used as input of Softmax cost function.

3. Experimental Results

3.1. LFW

The performance of our baseline deep networks and compared methods on LFW View2
  • LFW includes 5,749 subjects and 13,233 face images. There are three main protocols for performance reporting: unsupervised, restricted and unrestricted protocol.
  • Unsupervised protocol is used to evaluate the baseline performance of face representation and the other two protocols are usually used to evaluated the performance of metric learning or the whole method.
  • For all protocols, the test set is fixed, which includes 6000 face pairs in 10 splits.
  • Mean accuracy and standard error of the mean are reported.

The proposed base representation is better than DeepFace, 96.13% vs. 95.92%. After tuning on LFW by PCA, the accuracy 96.33% is improved slightly.

Schemes A to E
  • When using unrestricted protocol, the proposed single-network scheme E, using Joint Bayes on LFW training set, achieves 97.73%, which is better than DeepFace’s 7-networks ensemble 97.35% and is comparable to DeepID2’s 4-networks ensemble 97.75%.
  • It is noted that the scale of training set of DeepFace, SFC, is 10× larger than our CASIA-WebFace.

3.2. YTF

The performance of our methods and DeepFace on Youtube Faces (YTF).
  • Due to motion blur and high compression ratio, the quality of images in YTF are much worse than web photos. For each video in YTF, 15 frames are randomly selected and their representations are extracted by the proposed deep network (DR).
  • In the training stage, the 15 frames are seen as 15 samples with same identity.
  • In the testing stage, the similarity score of video pair is the mean value of 15×15=225 frame pairs.

Directly matching by Cosine function, the base representation achieves 88.00% accuracy on YTF.

  • Transforming the representation by PCA on YTF, the accuracy improves to 90.60% remarkably.
  • When tuning the representation by Joint Bayes further, our method outperforms DeepFace slightly.


[2014 arXiv] [CASIANet]
Learning Face Representation from Scratch

Face Recognition

2005 [Chopra CVPR’05] 2014 [DeepFace] [DeepID2] [CASIANet] 2015 [FaceNet] 2016 [N-pair-mc Loss]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.