Review: Improved Deep Metric Learning with Multi-class N-pair Loss Objective (N-pair-mc Loss)

Faster Convergence Using Multi-class N-pair Loss

Sik-Ho Tsang
9 min readNov 14, 2021

In this story, Improved Deep Metric Learning with Multi-class N-pair Loss Objective, by NEC Laboratories America, Inc., is reviewed. In this paper:

  • Triplet loss is generalized by allowing joint comparison among more than one negative examples.
  • An efficient batch construction strategy using only N pairs of examples, instead of (N+1)N.

This is a paper in 2016 NeurIPS with over 900 citations. (Sik-Ho Tsang @ Medium)


  1. Contrastive Loss & Triplet Loss
  2. (N+1)-Tuplet Loss for Multiple Negative Examples
  3. N-pair Loss as Efficient Batch Construction Method
  4. Hard Negative Class Mining & Regularization
  5. Experimental Results

1. Contrastive Loss & Triplet Loss

Deep Metric Learning with Triplet Loss
  • Let xX be an input data and y ∈ {1, …, L} be its output label.
  • x+ and x- denotes positive and negative examples of x, meaning that x and x+ are from the same class and x- is from different class to x.

1.1. Contrastive Loss

  • Contrastive loss takes pairs of examples as input and trains a network to predict whether two inputs are from the same class or not.
  • where m is a margin parameter imposing the distance between examples from different classes to be larger than m.

1.2. Triplet Loss

  • Triplet loss shares a similar spirit to contrastive loss, but is composed of triplets, each consisting of a query, a positive example (to the query), and a negative example:
  • Compared to contrastive loss, triplet loss only requires the difference of (dis-)similarities between positive and negative examples to the query point to be larger than a margin m.
  • Triplet loss pulls a positive example f+ away while pushing one negative example f- at a time.
  • Contrastive loss or triplet loss has been used for numerous applications such as face recognition and image retrieval such as Chopra CVPR’05, DrLIM, DeepFace, DeepID2, FaceNet.

Such frameworks often suffer from slow convergence and poor local optima, partially due to that the loss function employs only one negative example while not interacting with the other negative classes per each update.

  • Hard negative data mining could alleviate the problem, but it is expensive to evaluate embedding vectors in deep learning framework during hard negative example search.

2. (N+1)-Tuplet Loss for Multiple Negative Examples

Deep Metric Learning with (N+1)-Tuplet Loss
  • As shown above, (N+1)-tuplet loss pushes N-1 negative examples all at once, based on their similarity to the input example.
  • Consider an (N+1)-tuplet of training examples {x, x+, x1, …, xN-1}.
  • x+ is a positive example to x and {x1, …, xN-1} are negative. The (N+1)-tuplet loss is:
  • When N=2, the corresponding (2+1)-tuplet loss highly resembles the triplet loss as there is only one negative example for each pair of input and positive examples:
  • When N>2, It is further argued the advantages of (N+1)-tuplet loss over triplet loss. (N+1)-tuplet loss is compared with the triplet loss in terms of partition function estimation of an ideal (L+1)-tuplet loss, where an (L+1)-tuplet loss coupled with a single example per negative class can be written as follows:
  • Recall that L is the total number of classes.
  • Now, the above equation is similar to the multi-class logistic loss (i.e., softmax loss).

The partition function corresponding to the (N+1)-tuplet approximates that of (L+1)-tuplet, and larger the value of N, more accurate the approximation. Therefore, it naturally follows that (N+1)-tuplet loss is a better approximation than the triplet loss to an ideal (L+1)-tuplet loss.

  • However, it quickly becomes intractable when scaling up since the number of examples to evaluate in each batch grows in quadratic to the number of tuplets and their length N.
  • To overcome this, an efficient batch construction method is proposed that only requires 2N examples instead of (N+1)N to build N tuplets of length N+1.

3. N-pair Loss as Efficient Batch Construction Method

Triplet loss, (N+1)-tuplet loss, and multi-class N-pair loss with training batch construction
  • (a) Triplet Loss: For one f, there is one f+ and one f-. Batch size of N, one batch needs N of f, there is N of f+ and N of f-.
  • (b) (N+1)-Tuplet Loss: For one f, there is one f+ and N-1 f-. In total, N+1 examples. When the batch size of SGD is N, there are N(N+1) examples to be passed through f at one update.
  • Since the number of examples to evaluate for each batch grows in quadratic way, it again becomes impractical to scale the training for a very deep convolutional networks.
  • (c) N-pair-mc Loss: The multi-class N-pair loss (N-pair-mc), can be formulated as:
  • The proposed N-pair-mc loss is a novel framework consisting of two indispensable components: the (N+1)-tuplet loss, as the building block loss function, and the N-pair construction, as the key to enable highly scalable training.

That means each positive f+ for each f would becomes f- for other f, as shown in the above figure (c).

  • The tuplet batch construction is not specific to the (N+1)-tuplet loss, it can form the N-pair loss. For example, when integrated into the standard triplet loss, we obtain the following one-vs-one N-pair loss (N-pair-ovo):

4. Hard Negative Class Mining & Regularization

  • The hard negative data mining is considered as an essential component to many triplet-based distance metric learning algorithms.
  • Here, negative “class” mining is proposed , as opposed to negative “instance” mining, which greedily selects negative classes in a relatively efficient manner.
  • The negative class mining for N-pair loss can be executed as follows:
  1. Evaluate Embedding Vectors: choose randomly a large number of output classes C; for each class, randomly pass a few (one or two) examples to extract their embedding vectors.
  2. Select Negative Classes: select one class randomly from C classes from step 1. Next, greedily add a new class that violates triplet constraint the most w.r.t. the selected classes till we reach N classes. When a tie appears, we randomly pick one of tied classes.
  3. Finalize N-pair: draw two examples from each selected class from step 2.
  • Besides, L2-norm regularization is used to regularize the L2 norm of the embedding vectors to be small.

5. Experimental Results

  • For all our experiments except for the face verification, we use ImageNet pretrained GoogLeNet, for network initialization.
  • For face verification, CasiaNet is used but trained from scratch without the last fully-connected layer for softmax classification.

5.1. Fine-Grained Visual Object Recognition and Verification

  • Car-333 dataset is composed of 164,863 images of cars from 333 model categories collected from the internet. The dataset is split into 157,023 images for training and 7,840 for testing.
  • Flower-610 dataset contains 61,771 images of flowers from 610 different flower species and among all collected, 58,721 images are used for training and 3,050 for testing.
Mean recognition and verification accuracy with standard error on the test set of Car-333 and Flower-610 datasets
  • The recognition accuracy of all models are evaluated using kNN classifier while for models using softmax, the recognition accuracy are evaluated using softmax classifier(+).
  • The verification accuracy (VRF) is evaluated at different numbers of negative examples.

We observe consistent improvement of 72-pair loss models over triplet loss models. Although the negative data mining could bring substantial improvement to the baseline models, the performance is not as competitive as 72-pair loss models

  • Between N-pair loss models, the multi-class loss (72-pair-mc) shows better performance than the one-vs-one loss (72-pair-ovo).
  • When it compares to the softmax loss, the recognition performance of the 72-pair-mc loss models are competitive, showing slightly worse on Car-333 but better on Flower-610 datasets.
  • However, the performance of softmax loss model breaks down severely on the verification task. It is argued that the representation of the model trained with classification loss is not optimal for verification tasks.

5.2. Distance Metric Learning for Unseen Object Recognition

  • 3 datasets are used where the object categories between train and test sets are disjoint.
  • Stanford Online Product dataset is composed of 120,053 images from 22,634 online product categories, and is partitioned into 59,551 images of 11,318 categories for training and 60,502 images of 11,316 categories for testing.
  • Stanford Car-196 dataset is composed of 16,185 images of cars from 196 model categories. The first 98 model categories are used for training and the rest for testing.
  • Caltech-UCSD Birds (CUB-200) dataset is composed of 11,788 images of birds from 200 different species. Similarly, the first 100 categories are used for training.
F1, NMI, and recall@K scores on the test set of online product, Car-196, and CUB-200 datasets
  • The triplet loss model performs the worst among all losses considered. Negative data mining can alleviate the model to escape from the local optimum.
  • But the N-pair loss models outperforms even without additional computational cost for negative data mining.

5.3. Face Verification and Identification

  • Face verification and identification are the problems that determines whether two face images are the same identities (verification) and a problem that identifies the face image of the same identity from the gallery with many negative examples (identification).
  • Networks are trained on the WebFace database, which is composed of 494,414 images from 10,575 identities, and the quality of embedding networks trained with different metric learning objectives are evaluated on Labeled Faces in the Wild (LFW) database.
Mean Verification Accuracy (MRF), Rank-1 Accuracy, and DIR@FAR=1% rate of open-set identification on LFW dataset

and DIR@FAR=1% rate of open-set identification [1] on LFW dataset.

  • The triplet loss model shows 95.88% verification accuracy, but the performance breaks down on identification tasks.

The N-pair-mc loss model improves the performance by a significant margin. Furthermore, it is observed additional improvement by increasing N to 320, obtaining 98.33% for verification, 90.17% for closed-set and 71.76% for open-set identification accuracy.

  • The N-pair-ovo loss model performs much worse than the N-pair-mc loss on this problem.
Training curve of triplet, 192-pair-ovo, and 192-pair-mc loss models onWebFace database
  • It is observed that significantly faster training progress using 192-pair-mc loss than triplet loss; it only takes 15k iterations to reach at the loss at convergence of triplet loss model (240k iteration).

There are still experimental results not yet presented here, please feel free to read the paper. And this kind of loss for contrastive learning is somehow quite related to self-supervised learning. Deep metric learning is a big topic, but I put it at the category of face recognition first.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.