# Review: Improved Deep Metric Learning with Multi-class N-pair Loss Objective (N-pair-mc Loss)

**Faster Convergence Using Multi-class N-pair Loss**

--

In this story, **Improved Deep Metric Learning with Multi-class N-pair Loss Objective**, by NEC Laboratories America, Inc., is reviewed. In this paper:

**Triplet loss is generalized**by allowing joint comparison among**more than one negative examples**.**An efficient batch construction strategy**using**only**, instead of (*N*pairs of examples*N*+1)*N*.

This is a paper in **2016 NeurIPS** with over **900 citations**. (Sik-Ho Tsang @ Medium)

# Outline

**Contrastive Loss & Triplet Loss****(N+1)-Tuplet Loss for Multiple Negative Examples****N-pair Loss as Efficient Batch Construction Method****Hard Negative Class Mining & Regularization****Experimental Results**

**1. Contrastive Loss & Triplet Loss**

- Let
*x*∈*X***input data**andbe its*y*∈ {1, …,*L*}**output label**. , meaning that*x*+ and*x*- denotes positive and negative examples of*x**x*and*x*+ are from the same class and*x*- is from different class to*x*.

## 1.1. **Contrastive Loss**

**Contrastive loss**takes pairs of examples as input and trains a network to**predict whether two inputs are from the same class or not**.

- where
is a*m***margin**parameter imposing the distance between examples from different classes to be larger than*m*.

## 1.2. Triplet Loss

**Triplet loss**shares a similar spirit to contrastive loss, but is composed of triplets,**each consisting of a query, a positive example (to the query), and a negative example:**

- Compared to contrastive loss, triplet loss only requires the difference of (dis-)similarities between positive and negative examples to the query point to be larger than a margin
*m*. **Triplet loss****pulls a positive example**while*f*+ away**pushing one negative example***f-***Contrastive loss**or**triplet loss**has been used for numerous applications such as face recognition and image retrieval such as Chopra CVPR’05, DrLIM, DeepFace, DeepID2, FaceNet.

Such frameworks often suffer from

slow convergenceandpoor local optima, partially due to that the loss function employsonly one negative examplewhile not interacting with the other negative classes per each update.

**Hard negative data mining**could alleviate the problem, but it is**expensive**to evaluate embedding vectors in deep learning framework during**hard negative example search**.

**2. (N+1)-Tuplet Loss for **Multiple Negative Examples

- As shown above,
**(**, based on their similarity to the input example.*N*+1)-tuplet loss pushes*N*-1 negative examples all at once - Consider an (
*N*+1)-tuplet of training examples {*x*,*x*+,*x*1, …,*xN-*1}. is a*x*+**positive**example to*x*and**{**are*x*1, …,*xN-*1}**negative**. The (*N*+1)-tuplet loss is:

- When
*N*=2, the corresponding (2+1)-tuplet loss highly resembles the triplet loss as there is only one negative example for each pair of input and positive examples:

- When
*N*>2, It is further argued the advantages of (*N*+1)-tuplet loss over triplet loss. (*N*+1)-tuplet loss is compared with the triplet loss in terms of partition function estimation of an ideal (*L*+1)-tuplet loss, where an**(**coupled with*L*+1)-tuplet loss**a single example per negative class**can be written as follows:

- Recall that
*L*is the total number of classes. - Now, the above equation is
**similar to the multi-class logistic loss**(i.e., softmax loss).

The partition function corresponding to the (, andN+1)-tuplet approximates that of (L+1)-tupletlarger the value of. Therefore, it naturally follows thatN, more accurate the approximation(N+1)-tuplet loss is a better approximation than the triplet loss to an ideal (L+1)-tuplet loss.

- However, it quickly becomes
**intractable when scaling up**since the number of examples to evaluate in each batch grows in quadratic to the number of tuplets and their length*N*. - To overcome this, an
**efficient batch construction method**is proposed that only requires 2*N*examples instead of (*N*+1)*N*to build*N*tuplets of length*N*+1.

# 3. N-pair Loss as Efficient B**atch Construction Method**

**(a) Triplet Loss**: For one*f*, there is one*f*+ and one*f*-. Batch size of*N*, one batch needs*N*of*f*, there is*N*of*f*+ and*N*of*f*-.**(b) (**: For one*N*+1)-Tuplet Loss*f*, there is one*f*+ and*N*-1*f*-. In total,*N*+1 examples. When the batch size of SGD is*N*,**there are**.*N*(*N*+1) examples to be passed through*f*at one update- Since the number of examples to evaluate for each batch grows in quadratic way, it again becomes
**impractical**to scale the training for a very deep convolutional networks. **(c) N-pair-mc Loss**: The multi-class N-pair loss (N-pair-mc), can be formulated as:

- The proposed
*N*-pair-mc loss is a novel framework consisting of two indispensable components: the**(**, as the building block loss function, and the*N*+1)-tuplet loss, as the key to enable highly scalable training.*N*-pair construction

That means

each positivef+ for eachfwould becomesf- for otherf, as shown in the above figure (c).

- The tuplet batch construction is not specific to the (
*N*+1)-tuplet loss, it can form the. For example,*N*-pair loss**when integrated into the standard triplet loss**, we obtain the following**one-vs-one**:*N*-pair loss (*N*-pair-ovo)

**4. Hard Negative Class Mining & Regularization**

- The hard negative data mining is considered as an essential component to many triplet-based distance metric learning algorithms.
- Here,
**negative “class” mining is proposed**, as opposed to negative “instance” mining, which greedily selects negative classes in a relatively efficient manner. **The negative class mining for**can be executed as follows:*N*-pair loss

**Evaluate Embedding Vectors**: choose randomly a large number of output classes*C*; for each class, randomly pass a few (one or two) examples to extract their embedding vectors.**Select Negative Classes**: select one class randomly from C classes from step 1. Next, greedily add a new class that violates triplet constraint the most w.r.t. the selected classes till we reach*N*classes. When a tie appears, we randomly pick one of tied classes.**Finalize**: draw two examples from each selected class from step 2.*N*-pair

- Besides,
**L2-norm regularization**is used to regularize the L2 norm of the embedding vectors to be small.

**5. Experimental Results**

- For all our experiments except for the face verification, we use ImageNet pretrained GoogLeNet, for network initialization.
- For face verification, CasiaNet is used but trained from scratch without the last fully-connected layer for softmax classification.

## 5.1. Fine-Grained Visual Object Recognition and Verification

**Car-333**dataset is composed of 164,863 images of cars from 333 model categories collected from the internet. The dataset is split into 157,023 images for training and 7,840 for testing.**Flower-610**dataset contains 61,771 images of flowers from 610 different flower species and among all collected, 58,721 images are used for training and 3,050 for testing.

- The
**recognition accuracy**of all models are evaluated using**kNN classifier**while for models using softmax, the recognition accuracy**softmax**classifier(+). - The
**verification accuracy (VRF)**is evaluated at**different numbers of negative examples**.

We observe

consistent improvement of 72-pair loss models over triplet loss models.Although the negative data mining could bring substantial improvement to the baseline models, the performance is not as competitive as 72-pair loss models

- Between
*N*-pair loss models,**the multi-class loss (72-pair-mc) shows better performance than the one-vs-one loss (72-pair-ovo)**. **When it compares to the softmax loss**, the recognition performance of the**72-pair-mc loss models are competitive**, showing slightly worse on Car-333 but better on Flower-610 datasets.- However,
**the performance of softmax loss model breaks down severely on the verification task.**It is argued that the representation of the model trained with classification loss is not optimal for verification tasks.

## 5.2. Distance Metric Learning for Unseen Object Recognition

- 3 datasets are used where
**the object categories between train and test sets are disjoint**. **Stanford Online Product dataset**is composed of**120,053 images from 22,634 online product categories**, and is partitioned into 59,551 images of 11,318 categories for training and 60,502 images of 11,316 categories for testing.**Stanford Car-196 dataset**is composed of**16,185 images of cars from 196 model categories.**The first 98 model categories are used for training and the rest for testing.**Caltech-UCSD Birds (CUB-200) dataset**is composed of**11,788 images of birds from 200 different species.**Similarly, the first 100 categories are used for training.

- The triplet loss model performs the worst among all losses considered. Negative data mining can alleviate the model to escape from the local optimum.
- But the
.*N*-pair loss models outperforms even without additional computational cost for negative data mining

## 5.3. Face Verification and Identification

- Face verification and identification are the problems that determines
**whether two face images are the same identities (verification)**and a problem that**identifies the face image of the same identity from the gallery with many negative examples (identification)**. - Networks are
**trained on the WebFace database**, which is composed of 494,414 images from 10,575 identities, and the quality of embedding networks trained with different metric learning objectives are**evaluated on Labeled Faces in the Wild (LFW) database**.

and DIR@FAR=1% rate of open-set identification [1] on LFW dataset.

**The triplet loss**model shows 95.88% verification accuracy, but the performance**breaks down on identification tasks.**

TheFurthermore, it is observed additional improvement by increasingN-pair-mc loss model improves the performance by a significant margin.Nto 320, obtaining 98.33% for verification, 90.17% for closed-set and 71.76% for open-set identification accuracy.

**The**on this problem.*N*-pair-ovo loss model performs much worse than the N-pair-mc loss

- It is observed that
**significantly faster training progress using 192-pair-mc loss than triplet loss**; it only takes 15k iterations to reach at the loss at convergence of triplet loss model (240k iteration).

There are still experimental results not yet presented here, please feel free to read the paper. And this kind of loss for **contrastive learning **is somehow quite related to** self-supervised learning**. Deep metric learning is a big topic, but I put it at the category of face recognition first.

## Reference

[2016 NeurIPS] [N-pair-mc Loss]

Improved Deep Metric Learning with Multi-class N-pair Loss Objective

## Face Recognition

**2005** [Chopra CVPR’05] **2014** [DeepFace] [DeepID2] **2015 **[FaceNet] **2016 **[N-pair-mc Loss]